Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
352 changes: 352 additions & 0 deletions docs/proposals/post-deployment-hooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,352 @@
# Deployment Hooks

## Abstract

A proposal for a deployment code execution API integrated with the deployment lifecycle.


## Motivation

Deployment hooks are needed to provide users with a way to execute arbitrary commands necessary to complete a deployment.

Goals of this design:

1. Identify deployment hook use cases
2. Define the integration of deployment hooks and the deployment lifecycle
3. Describe a deployment hook API


## Comparison of potential approaches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a note here saying that fundamentally, these are the choices you have to evaluate the use-cases in terms of, since this ties into the next section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised this section, ptal.


There are two fundamental approaches to solving each deployment hook use case: existing upstream support for *container lifecycle hooks*, and the externalized *deployment hooks* outlined in this proposal. The following describes the two approaches, and each use case is evaluated in terms of these approaches.

##### Upstream container lifecycle hooks

Kubernetes provides *container lifecycle hooks* for containers within a Pod. Currently, post-start and pre-stop hooks are supported. For deployments, post-start is the most relevant. Because these hooks are implemented by the Kubelet, the post-start hook provides some unique guarantees:

1. The hook is executed synchronously during the pod lifecycle.
2. The status of the pod is linked to the status and outcome of the hook execution.
1. The pod will not enter a ready status until the hook has completed successfully.
1. Service endpoints will not be created for the pod until the pod has a ready status.
1. If the hook fails, the pod's creation is considered a failure, and the retry behavior is restart-policy driven in the usual way.

Because deployments are represented as replication controllers, lifecycle hooks defined for containers are executed for every container in the replica set for the deployment. This behavior has complexity implications when applied to deployment use cases:

1. The hooks for all pods in the deployment will race, placing a burden on hook authors (e.g., the hooks would generally need to be tolerant of concurrent hook execution and implement manual coordination.)


##### Deployment hooks

An alternative to the upstream-provided lifecycle hooks is to have a notion of a hook which is a property of an OpenShift deployment. OpenShift deployment hooks can provide a different set of guarantees:

1. Hooks can be bound to the logical deployment lifecycle, enabling hook executions decoupled from replication mechanics.
1. Races can be avoided by defining a hook that executes at least once per deployment regardless of the replica count.
2. Hooks defined in terms of deployments are conceptually easier to reason about from the perspective of a user defining a deployment workflow.

Hooks can be defined to execute before or after the deployment strategy scales up the deployment. When implementing a hook which runs after a deployment has been scaled up, there are special considerations to make:

1. Nothing prevents external connectivity races: when the deployment's pods are ready, they become routable to services and other pods within an application the moment their containers enter a ready state, likely before or during or before hook execution.
2. Hook execution can't be atomically linked to the deployment pods' statuses: If a hook failure should result in deployment failure, a previously scaled and exposed application must be rolled back when ideally the application shouldn't have been exposed prior to hook success.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we wait to create the new deployment until the hook is successful? (assuming the hook is configured to run before the deployment). Same question for #1 above.

does this only apply to hooks that need to run after the pod is created (but ideally before it is exposed to the world)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These considerations are qualified as being applicable only "When implementing a hook which runs after a deployment has been scaled up," which is one of two choices for a hook author. I present the two different types of hooks later on in "Proposed design."

Is there a way I could make this more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an attempt to clarify this, PTAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so are there three scenarios/options here?

  1. run my hook before the deployment (and potentially abort the deployment if the hook fails)
  2. run my hook exactly once, after the deployment (challenges with the hook not completing before the deployment is exposed, or the hook attempting to run before the deployment is truly up)
  3. run my hook on every pod create event (implicitly including scale up events) (the existing k8s post start functionality)

and the issues listed here would apply to both item 2+ item 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. run my hook before the deployment (and potentially abort the deployment if the hook fails)

Basically - but a subtle clarification: the deployment would exist (with a replica count 0) but wouldn't be scaled (the strategy wouldn't run) until the hook completed.

  1. run my hook exactly once, after the deployment (challenges with the hook not completing before the deployment is exposed, or the hook attempting to run before the deployment is truly up)

Yeah, "after the deployment" means the strategy could have scaled up the deployment, and now it could be either partially or fully deployed and exposed at any time.

  1. run my hook on every pod create event (implicitly including scale up events) (the existing k8s post start functionality)

Yes.

and the issues listed here would apply to both item 2+ item 3?

The issues listed in this specific part apply to approach 2. Approach 3 has its own issues detailed in the section above dedicated to 3.



## Use cases

1. As a Rails application developer, I want to perform a Rails database migration following an application deployment.
2. As an application developer, I want to invoke a cloud API endpoint to notify it of the presence of new code.


#### Use-case: Rails migrations

New revisions of a Rails application often contain schema or other database updates which must accompany the new code deployment. Users should be able to specify a hook which performs a Rails migration as part of the application code deployment.

Database migrations are complex and introduce downtime concerns. Here are [some examples](https://blog.rainforestqa.com/2014-06-27-zero-downtime-database-migrations) of zero-downtime Rails migration workflows.

Deployments including database migrations must make special considerations:

1. Code must be newer than the schema in use, or all old code must be stopped before the new schema is introduced.
2. Database or table locking must be minimized or eliminated to prevent service outages.

The workflows which are effective at ensuring zero downtime migrations are typically multi-phased. For a user orchestrating a zero downtime migration deployment, it's likely the user needs to verify each deployment step discretely, with the option to abort and rollback after each phase.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an example of the phases of a deployment would add a lot to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check it out.


Consider this simple example of a phased deployment which adds a new column:

1. Deploy a new migration which adds the new column.
1. The user can verify that the new column didn't break the application.
2. Deploy a new version of the code which makes use of the new column.
1. The user can verify the new code interacts correctly with the new column.

###### Container lifecycle hooks

Container lifecycle hooks introduce problems with Rails migrations:

1. There is no way to guarantee that pods with older code are not running.
2. The migration hook will execute in the same pod as application containers, consuming resources allocated for the application.
1. This can cause instability, as it's unlikely the application pod resource allocation takes into account the temporarily increased requirements of a transient deployment step.

###### Deployment hooks

Deployment hooks satisfy this use case by providing a means to execute the hook only once per logical deployment. The hook is expressed as a run-once pod which provides the migration with its own resource allocation decoupled from the application.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, an example would really help. This paragraph isn't very convincing without them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a little example. Went with the 2 phase rather than 5 to illustrate the point.



#### Use-case: Invoke a cloud API endpoint

Consider an application whose deployment should result in a cloud API call being invoked to notify it of the newly deployed code.

###### Container lifecycle hooks

Container lifecycle hooks aren't ideal for this use case because they will be fired once per pod in the deployment during scale-up rather than following the logical deployment as a whole. Consider an example deployment flow using container lifecycle hooks:

1. Deployment is created.
2. Deployment is scaled up to 10 by the deployment strategy.
3. The cloud API is invoked 10 times.
4. Deployment is considered complete concurrently with the cloud API calls.
5. Deployment is scaled up to 15 to handle increased application traffic.
6. The cloud API is invoked 5 times, outside the deployment workflow.

###### Deployment hooks

A post-deployment hook would satisfy the use case by ensuring the API call is invoked after the deployment has been rolled out. For example, the flow of this deployment would be:

1. Deployment is created.
2. Deployment is scaled up to 10 by the deployment strategy.
3. Deployment hook fires, invoking the cloud API.
4. Deployment is considered complete.
5. Deployment is scaled up to 15 to handle increased application traffic.
6. No further calls to cloud API are made until next deployment.


## Proposed design

Deployment hooks are implemented as run-once pods which can be executed at one or both of the following points during the deployment lifecycle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldnt I solve this with a custom deployment process? Give a counter argument to that (complexity, have to maintain my own image). Then also explore letting simple script be injected into the deployment process. Give counter arguments to that (has coupling to our deployer image, even though it makes writing custom deoyment easier).

As discussed on irc, seems three levels of complexity:

  1. Run run once pod during deployment process using same image and an executable in the image
  2. Provide snippet of bash to run inside deployment process pod
  3. Provide custom deployment inage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this reply is meant to address not only the adjacent comment but also this one and this one)

My thinking was that if your hook code was totally decoupled from your deployment image, you go down a path where you end up needing to capture all the pod template fields to run the hook anyway (first it's the command, then maybe you also want to inject special environment into your hook, you also might want to set the restart policy if it's retryable, etc.). And we already have an API for that input via PodTemplate. I agree it might be overkill, but we'll have to draw some line about what subset of fields we'd accept to ultimately jam into a PodTemplate.

What I have described currently could be 0 in your list:

  1. Run run once pod during deployment process using arbitrary image and an executable in the image

If we're willing to narrow the capabilities to your 1-3 list, we can come up with a more special purpose and concise API.

So, do we scrap 0 and move on to figuring out how to provide the API for 1 and 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, after talking with @danmcp and @abhgupta, we seemed to agree the following inputs might be satisfactory for most needs:

  1. A command to run (as a string array)
  2. A named reference to a container within the deployment pod template

For the hook execution, a pod would be created using the command (1) and whatever image is used by (2). The assumption (limitation) being that your hook code is available in your deployed images.

If we had the ability to mount a script as a volume in a pod, we could support running the provided script as the entrypoint against the specified container image.

Three more scenarios came up:

  1. Running a command not provided by the deployment image
    1. Would require a third input: image reference (to use instead of a container name reference)
  2. Running more than one hook per phase
    1. Would require handling an array of hooks, which introduces more complexity (especially around failure modes: how to handle partial failures, etc.)
  3. A combination of 1 and 2: running multiple hooks per phase, where the command for some is provided by the deployment images, and the command for others is provided by some other image(s).

We also discussed this:

Provide snippet of bash to run inside deployment process pod

This scenario assumes we have the mounting capability (or some other way to get the script into the container), and also has some problems:

  1. The context for your script is going to be dependent on the strategy in use (e.g. our deployer image), which seems unlikely to have the necessary code/userspace for the hook
  2. The hooks would be constrained to the resource allocations for the deployer pod, which are likely to be extremely minimal

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the extension to allow you to point to an arbitrary image to run the command in is probably necessary. it's pretty easy to imagine someone using our mysql image and wanting to use that image to run some mysql client commands and not wanting to embed all their commands in a string, so now they need a script, but they have no way to get that script into our mysql image. (they'll have to create their own image, perhaps layered on top of ours)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- Original Message -----

+###### Container lifecycle hooks
+
+Container lifecycle hooks are inappropriate for this use case. The webhook
should be called only once. Without a hook which coordinates with other
replicated pod hook executions, the hook would:
+
+1. Be fired once per pod in the deployment.
+2. Be fired during the scale-up of the deployment rather than following
the logical deployment as a whole.
+
+###### Deployment hooks
+
+A deployment hook configured to run following the scale-up of the
deployment would satisfy the use case.
+
+
+## Proposed design
+
+Deployment hooks are implemented as run-once pods which can be executed at
one or both of the following points during the deployment lifecycle:

So, after talking with @danmcp and @abhgupta, we seemed to agree the
following inputs might be satisfactory for most needs:

  1. A command to run (as a string array)
  2. A named reference to a container within the deployment pod template

Defaults to the first container.

For the hook execution, a pod would be created using the command (1) and
whatever image is used by (2). The assumption (limitation) being that your
hook code is available in your deployed images.

If we had the ability to mount a script as a volume in a pod, we could
support running the provided script as the entrypoint against the specified
container image.

We keep talking about "mounting" but I don't think that's useful. Snippet should be provided direct with a DC as part of the process definition, or you make it part of a custom deployment.

Three more scenarios came up:

  1. Running a command not provided by the deployment image
    1. Would require a third input: image reference (to use instead of a
      container name reference)

Use a custom deployment process or put the command in your image.

  1. Running more than one hook per phase
    1. Would require handling an array of hooks, which introduces more
      complexity (especially around failure modes: how to handle partial
      failures, etc.)

You can make the hook call other hooks.

  1. A combination of 1 and 2: running multiple hooks per phase, where the
    command for some is provided by the deployment images, and the command for
    others is provided by some other image(s).

Yuck

We also discussed this:

Provide snippet of bash to run inside deployment process pod

This scenario assumes we have the mounting capability (or some other way to
get the script into the container), and also has some problems:

  1. The context for your script is going to be dependent on the strategy in
    use (e.g. our deployer image), which seems unlikely to have the necessary
    code/userspace for the hook

The necessary code / userspace is "osc", curl.

  1. The hooks would be constrained to the resource allocations for the
    deployer pod, which are likely to be extremely minimal

Again, "osc", curl, a few others.

Thoughts?


1. Before the execution of the deployment strategy, which ensures that the hook is run and its outcome evaluated prior to the scale-up of the deployment. These hooks are referred to as *pre-deployment* hooks.
2. After the execution of the deployment strategy. These hooks are referred to as *post-deployment* hooks.

##### Hook failure handling

Hooks designated as *mandatory* should impact the outcome of the deployment.

There are a few possible ways to handle a failed mandatory pre-deployment hook:

1. Transition the deployment to a failed status and do not execute the strategy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you already executed the strategy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, nevermind

2. Delete and retry the deployment.
1. Potentially safe because the strategy has not yet executed and the existing prior deployment is still active.
1. Could be unsafe depending on what the hook did prior to failing.
2. Further API considerations need made to prevent endless deployment attempts.
3. Rollback the deployment to a previous version.
1. This is very dangerous to do automatically and is probably not realistic at this time:
1. Requires automated rollback viability analysis/trust.
2. Requires logic to prevent a chronically failing hook that spans historical deployments from causing unending rollbacks.

This proposal prescribes the use of option 1 as being the simplest starting point for the hook API.

Failed mandatory post-deployment hooks are more challenging:

1. The deployment has most likely already been rolled out and made live by the strategy.
2. Deleting the deployment is no longer safe due to 1.
3. Rollback is necessary and subject to the same challenges presented above.

Due to the complexities of automated rollback, this proposal limits the scope of failure handling for post-deployment hooks: post-deployment hooks cannot be considered mandatory at this time. This limitation may be lifted in the future by an separate proposal.

##### Hook failure reporting

When a deployment hook fails:

1. An error is logged via the global error handler.
2. The hook status is available as an annotation on the deployment.

More reporting capabilities could be addressed in a future proposal.


### Deployment hooks API

The `DeploymentStrategy` gains a new `Lifecycle` field:

```go
type DeploymentStrategy struct {
// Type is the name of a deployment strategy.
Type DeploymentStrategyType `json:"type,omitempty"`
// CustomParams are the input to the Custom deployment strategy.
CustomParams *CustomDeploymentStrategyParams `json:"customParams,omitempty"`
// Lifecycle provides optional hooks into the deployment process.
Lifecycle *Lifecycle `json:"lifecycle,omitempty"`
}
```

```go
// Lifecycle describes actions the system should take in response to
// deployment lifecycle events. The deployment process blocks while
// executing lifecycle handlers. A HandleFailurePolicy determines what
// action is taken in response to a failed handler.
type Lifecycle struct {
// Pre is called immediately before the deployment strategy executes.
Pre *Handler `json:"pre,omitempty"`
// Post is called immediately after the deployment strategy executes.
// NOTE: AbortHandlerFailurePolicy is not supported for Post.
Post *Handler `json:"post,omitempty"`
}
```

Each lifecycle hook is implemented with a `Handler`:

```go
// Handler defines a specific deployment lifecycle action.
type Handler struct {
// ExecNewPod specifies the action to take.
ExecNewPod *ExecNewPodAction `json:"execNewPod,omitempty"`
// FailurePolicy specifies what action to take if the handler fails.
FailurePolicy HandlerFailurePolicy `json:"failurePolicy"`
}
```

The first handler implementation is pod-based:

```go
// ExecNewPodAction runs a command in a new pod based on the specified
// container which is assumed to be part of the deployment template.
type ExecNewPodAction struct {
// Command is the action command and its arguments.
Command []string `json:"command"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be able to set/override environment.

// Env is a set of environment variables to supply to the action's container.
Env []EnvVar `json:"env,omitempty"`
// ContainerName is the name of a container in the deployment pod
// template whose Docker image will be used for the action's container.
ContainerName string `json:"containerName"`
}
```

Handler failure management is policy driven:

```go
// HandlerFailurePolicy describes the action to take if a handler fails.
type HandlerFailurePolicy string

const(
// RetryHandlerFailurePolicy means retry the handler until it succeeds.
RetryHandlerFailurePolicy HandlerFailurePolicy = "Retry"
// AbortHandlerFailurePolicy means abort the deployment (if possible).
AbortHandlerFailurePolicy HandlerFailurePolicy = "Abort"
// ContinueHandlerFailurePolicy means continue the deployment.
ContinueHandlerFailurePolicy HandlerFailurePolicy = "Continue"
)
```

`ExecNewPodAction` pods will be associated with deployments using new annotations:

```go
const (
// PreExecNewPodActionPodAnnotation is the name of a pre-deployment
// ExecNewPodAction pod.
PreExecNewPodActionPodAnnotation = "openshift.io/deployment.lifecycle.pre.execnewpod.pod"
// PreExecNewPodActionPodPhaseAnnotation is the phase of a pre-deployment
// ExecNewPodAction pod and is used to track its status and outcome.
PreExecNewPodActionPodPhaseAnnotation = "openshift.io/deployment.lifecycle.pre.execnewpod.phase"
// PostExecNewPodActionPodAnnotation is the name of a post-deployment
// ExecNewPodAction pod.
PostExecNewPodActionPodAnnotation = "openshift.io/deployment.lifecycle.post.execnewpod.pod"
// PostDeploymentHookPodPhaseAnnotation is the phase of a post-deployment
// ExecNewPodAction pod and is used to track its status and outcome.
PostExecNewPodActionPodPhaseAnnotation = "openshift.io/deployment.lifecycle.post.execnewpod.phase"
)
```


### Example: Rails migration

Here's an example deployment which demonstrates how to apply deployment hooks to a Rails application which uses migrations.

The application image `example/rails` is built with a `Dockerfile` based on the `rails` image from Docker Hub:

```dockerfile
FROM rails:onbuild
```

A database is exposed to the application using a service:

```json
{
"kind": "Service",
"apiVersion": "v1beta1",
"id": "mysql",
"containerPort": 3306,
"port": 5434,
"selector": {
"name": "mysql"
}
},
```

A deployment configuration describes the template for application deployments:

```json
{
"kind": "DeploymentConfig",
"apiVersion": "v1beta1",
"metadata": {
"name": "rails",
"description": "A sample Rails application."
},
"triggers": [
{
"type": "ConfigChange"
}
],
"template": {
"strategy": {
"type": "Recreate",
"lifecycle": {
"pre": {
"execNewPod": {
"container": "rails",
"command": ["rake", "db:migrate"],
"env": [
{
"name": "CUSTOM_VAR",
"value": "custom_value"
},
]
},
"failurePolicy": "Retry"
}
}
},
"controllerTemplate": {
"replicas": 1,
"replicaSelector": {
"name": "rails"
},
"podTemplate": {
"desiredState": {
"manifest": {
"version": "v1beta1",
"containers": [
{
"name": "rails",
"image": "example/rails",
"ports": [
{
"containerPort": 8080
}
]
}
]
}
},
"labels": {
"name": "rails"
}
}
}
}
}
```

Let's consider a hypothetical timeline of events for this deployment, assuming that the initial version of the application is already deployed as `rails-1`.

1. A new version of the `example/rails` image triggers a deployment of the `rails` deployment configuration.
2. A new deployment `rails-2` is created with 0 replicas; the deployment is not yet live.
3. The `pre` hook command `rake db:migrate` is executed in a container using the `example/rails` image as specified in the `rails` container.
1. The `rake` command connects to the database using environment variables provided for the `mysql` service.
4. When `rake db:migrate` finishes successfully, the `Recreate` strategy executes, causing the `rails-2` deployment to become live and `rails-1` to be disabled.
1. Because `failurePolicy` is set to `Retry`, if the `rake` command fails, it will be retried and the deployment will not proceed until the command succeeds.
5. Since there is no `post` hook, the deployment is now complete.