New annotations for safety #241

dturn · 2018-01-24T16:12:34Z

Motivation

We recently had a deployment that pruned a k8s web server deployment. This was surprisingly easy to do with erb. When we reverted the pr and re-deployed the resource was re-created but because we didn't hard code the replica count it came back with only 1 resource. Here are two proposed features that would add some safety checks.

Features

An annotation (`kubernetes-deploy.shopify.io/prune: 'never') to a template to prevent pruning and failing the deploy if it would have been pruned.
Annotation (`kubernetes-deploy.shopify.io/min-replicas: '10') to ensure a minimum number of replicas are running. This is a feature that k8s proper rejected Feature request: ability to specify a min/max replica count on scaleable resources kubernetes/kubernetes#33843 (comment) and doesn't seem possible with an hpa.

@KnVerey

KnVerey · 2018-01-24T23:44:06Z

An annotation (`kubernetes-deploy.shopify.io/prune: 'never') to a template to prevent pruning and failing the deploy if it would have been pruned.

How would this work? Pruning happens when the template is not in the set, so we aren't going to see an annotation on something we mistakenly don't have. We're also just using the prune feature built into apply, so we currently don't have any idea what's going to be pruned ahead of time (though maybe the --dry-run output would say, if we ran that first? I don't know.).

Perhaps that could be made a feature of k8s-template-validator instead... CI time is probably more ideal than deploy time for catching things like that.

Annotation (`kubernetes-deploy.shopify.io/min-replicas: '10') to ensure a minimum number of replicas are running.

Would this be looked at during validation or verification? In other words, are you thinking we should:

try to predict whether the template we have + the current state in the cluster will result in fewer replicas than specified, and modify the desired state we send to the server in the first place
fail the deploy if spec.replicas ends up less than the number specified
issue a scaling command if spec.replicas ends up less than the number specified
something else?

dturn · 2018-01-25T14:44:52Z

How would this work? Pruning happens when the template is not in the set, so we aren't going to see an annotation on something we mistakenly don't have. We're also just using the prune feature built into apply, so we currently don't have any idea what's going to be pruned ahead of time (though maybe the --dry-run output would say, if we ran that first? I don't know.).

Perhaps that could be made a feature of k8s-template-validator instead... CI time is probably more ideal than deploy time for catching things like that.

I was thinking this would have to be done at run-time since we don't let CI talk to the cluster.
This seems like it might be related to #235 that we could use a selector to prevent pruning resources with the annotation. (Though its not clear to me if the -l flag to kubectl apply only applies to pruning or not). Using the output from --dry-run is an interesting idea, hadn't considered it.

Would this be looked at during validation or verification? In other words, are you thinking we should:

try to predict whether the template we have + the current state in the cluster will result in fewer replicas than specified, and modify the desired state we send to the server in the first place

fail the deploy if spec.replicas ends up less than the number specified

issue a scaling command if spec.replicas ends up less than the number specified

something else?

I was thinking this would add code that verifies there are at least n replicas desired and if not scales the deploy up to that amount. I'd lean towards putting this in the deploy phase, but could see it being a new step.

KnVerey · 2018-01-25T18:27:49Z

I was thinking this would have to be done at run-time since we don't let CI talk to the cluster.

It doesn't really need to talk to the cluster though, does it? Fundamentally, the feature boils down to "make sure this template is in the set", which is perfectly doable locally. Honestly I think this is something that should be done ahead of time using an external list, and kubernetes-deploy should be able to assume that you actually want to deploy what you've given it.

its not clear to me if the -l flag to kubectl apply only applies to pruning or not

It applies to everything. The PR in question basically lets you operate the same as you usually would, but at a sub-namespace level. It's pretty cool, but I don't think it's relevant here.

I was thinking this would add code that verifies there are at least n replicas desired and if not scales the deploy up to that amount. I'd lean towards putting this in the deploy phase, but could see it being a new step.

Safety features feel like a no-brainer, but I'm still hesitating about this for some reason. One thing that feels a bit off is that it introduces a new kind of responsibility to the gem, basically a naive metric-less HPA, as you pointed out. Another is that there are already three ways I can think of to manage replicas:
A - Hardcode the count in your spec. The new annotation would be incompatible/redundant with this strategy.
B - Manage them with an HPA. The new annotation could fight with this strategy, e.g. if your HPA says "scale between 5-10 replicas" and your annotation says "gimme at least 11" (same problem you'd have by trying to do both A and B).
C - Manage them manually. I take it that is what we are doing for the deployment that caused trouble. Is that because we adjust them too frequently for (A) and aren't able to do a custom-metrics-based (B) just yet? Would the minimum safe number of replicas not change just as often as you manually scale? If the bad deploy had dropped you to 1 replica, then scaled you back to 100 or whatever 1-5 minutes later, don't you still need alerting telling you to go manually put it back to the actual number it should have been?

cc @klautcomputing @stefanmb any opinions on these features?

dturn · 2018-01-26T14:26:46Z

After giving it more thought I think you're right that kubernetes-deploy.shopify.io/prune: 'never' is better checked in ci.

However, I still would like to hear what others think about min-replicas

stefanmb · 2018-01-29T22:53:29Z

I think I'm missing some context:

we didn't hard code the replica count it came back with only 1 resource

Why doesn't hardcoding the replica count solve the issue?

I'm not a fan of coercing user requested values into sane defaults because I think it masks other underlying issues.

dturn · 2018-01-30T17:02:26Z

Why doesn't hardcoding the replica count solve the issue?

We could hard-code the replicas into the templates, but it comes with its own downside. The biggest would be that it takes a deploy to scale, and that's something that's on the order 10s of minutes.

Manage them manually. I take it that is what we are doing for the deployment that caused trouble. Is that because we adjust them too frequently for (A) and aren't able to do a custom-metrics-based (B) just yet? Would the minimum safe number of replicas not change just as often as you manually scale? If the bad deploy had dropped you to 1 replica, then scaled you back to 100 or whatever 1-5 minutes later, don't you still need alerting telling you to go manually put it back to the actual number it should have been?

They're being managed manually, because we're still modifying it frequently enough that doing it via a deploy would be painful. And I personally think we wont ever be able to use a custom-metric for scaling. The min safe would change, but even something out of date would be a better default than 1.

dturn · 2018-10-11T17:02:51Z

Closing as wont implement.

dturn closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New annotations for safety #241

New annotations for safety #241

dturn commented Jan 24, 2018

KnVerey commented Jan 24, 2018

dturn commented Jan 25, 2018

KnVerey commented Jan 25, 2018 •

edited

Loading

dturn commented Jan 26, 2018

stefanmb commented Jan 29, 2018

dturn commented Jan 30, 2018

dturn commented Oct 11, 2018

New annotations for safety #241

New annotations for safety #241

Comments

dturn commented Jan 24, 2018

Motivation

Features

KnVerey commented Jan 24, 2018

dturn commented Jan 25, 2018

KnVerey commented Jan 25, 2018 • edited Loading

dturn commented Jan 26, 2018

stefanmb commented Jan 29, 2018

dturn commented Jan 30, 2018

dturn commented Oct 11, 2018

KnVerey commented Jan 25, 2018 •

edited

Loading