Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New annotations for safety #241

Closed
dturn opened this issue Jan 24, 2018 · 7 comments
Closed

New annotations for safety #241

dturn opened this issue Jan 24, 2018 · 7 comments

Comments

@dturn
Copy link
Contributor

dturn commented Jan 24, 2018

Motivation

We recently had a deployment that pruned a k8s web server deployment. This was surprisingly easy to do with erb. When we reverted the pr and re-deployed the resource was re-created but because we didn't hard code the replica count it came back with only 1 resource. Here are two proposed features that would add some safety checks.

Features

@KnVerey

@KnVerey
Copy link
Contributor

KnVerey commented Jan 24, 2018

An annotation (`kubernetes-deploy.shopify.io/prune: 'never') to a template to prevent pruning and failing the deploy if it would have been pruned.

How would this work? Pruning happens when the template is not in the set, so we aren't going to see an annotation on something we mistakenly don't have. We're also just using the prune feature built into apply, so we currently don't have any idea what's going to be pruned ahead of time (though maybe the --dry-run output would say, if we ran that first? I don't know.).

Perhaps that could be made a feature of k8s-template-validator instead... CI time is probably more ideal than deploy time for catching things like that.

Annotation (`kubernetes-deploy.shopify.io/min-replicas: '10') to ensure a minimum number of replicas are running.

Would this be looked at during validation or verification? In other words, are you thinking we should:

  • try to predict whether the template we have + the current state in the cluster will result in fewer replicas than specified, and modify the desired state we send to the server in the first place
  • fail the deploy if spec.replicas ends up less than the number specified
  • issue a scaling command if spec.replicas ends up less than the number specified
  • something else?

@dturn
Copy link
Contributor Author

dturn commented Jan 25, 2018

How would this work? Pruning happens when the template is not in the set, so we aren't going to see an annotation on something we mistakenly don't have. We're also just using the prune feature built into apply, so we currently don't have any idea what's going to be pruned ahead of time (though maybe the --dry-run output would say, if we ran that first? I don't know.).

Perhaps that could be made a feature of k8s-template-validator instead... CI time is probably more ideal than deploy time for catching things like that.

I was thinking this would have to be done at run-time since we don't let CI talk to the cluster.
This seems like it might be related to #235 that we could use a selector to prevent pruning resources with the annotation. (Though its not clear to me if the -l flag to kubectl apply only applies to pruning or not). Using the output from --dry-run is an interesting idea, hadn't considered it.

Would this be looked at during validation or verification? In other words, are you thinking we should:

  • try to predict whether the template we have + the current state in the cluster will result in fewer replicas than specified, and modify the desired state we send to the server in the first place
  • fail the deploy if spec.replicas ends up less than the number specified
  • issue a scaling command if spec.replicas ends up less than the number specified
  • something else?

I was thinking this would add code that verifies there are at least n replicas desired and if not scales the deploy up to that amount. I'd lean towards putting this in the deploy phase, but could see it being a new step.

@KnVerey
Copy link
Contributor

KnVerey commented Jan 25, 2018

I was thinking this would have to be done at run-time since we don't let CI talk to the cluster.

It doesn't really need to talk to the cluster though, does it? Fundamentally, the feature boils down to "make sure this template is in the set", which is perfectly doable locally. Honestly I think this is something that should be done ahead of time using an external list, and kubernetes-deploy should be able to assume that you actually want to deploy what you've given it.

its not clear to me if the -l flag to kubectl apply only applies to pruning or not

It applies to everything. The PR in question basically lets you operate the same as you usually would, but at a sub-namespace level. It's pretty cool, but I don't think it's relevant here.

I was thinking this would add code that verifies there are at least n replicas desired and if not scales the deploy up to that amount. I'd lean towards putting this in the deploy phase, but could see it being a new step.

Safety features feel like a no-brainer, but I'm still hesitating about this for some reason. One thing that feels a bit off is that it introduces a new kind of responsibility to the gem, basically a naive metric-less HPA, as you pointed out. Another is that there are already three ways I can think of to manage replicas:
A - Hardcode the count in your spec. The new annotation would be incompatible/redundant with this strategy.
B - Manage them with an HPA. The new annotation could fight with this strategy, e.g. if your HPA says "scale between 5-10 replicas" and your annotation says "gimme at least 11" (same problem you'd have by trying to do both A and B).
C - Manage them manually. I take it that is what we are doing for the deployment that caused trouble. Is that because we adjust them too frequently for (A) and aren't able to do a custom-metrics-based (B) just yet? Would the minimum safe number of replicas not change just as often as you manually scale? If the bad deploy had dropped you to 1 replica, then scaled you back to 100 or whatever 1-5 minutes later, don't you still need alerting telling you to go manually put it back to the actual number it should have been?

cc @klautcomputing @stefanmb any opinions on these features?

@dturn
Copy link
Contributor Author

dturn commented Jan 26, 2018

After giving it more thought I think you're right that kubernetes-deploy.shopify.io/prune: 'never' is better checked in ci.

However, I still would like to hear what others think about min-replicas

@stefanmb
Copy link
Contributor

I think I'm missing some context:

we didn't hard code the replica count it came back with only 1 resource

Why doesn't hardcoding the replica count solve the issue?

I'm not a fan of coercing user requested values into sane defaults because I think it masks other underlying issues.

@dturn
Copy link
Contributor Author

dturn commented Jan 30, 2018

Why doesn't hardcoding the replica count solve the issue?

We could hard-code the replicas into the templates, but it comes with its own downside. The biggest would be that it takes a deploy to scale, and that's something that's on the order 10s of minutes.

Manage them manually. I take it that is what we are doing for the deployment that caused trouble. Is that because we adjust them too frequently for (A) and aren't able to do a custom-metrics-based (B) just yet? Would the minimum safe number of replicas not change just as often as you manually scale? If the bad deploy had dropped you to 1 replica, then scaled you back to 100 or whatever 1-5 minutes later, don't you still need alerting telling you to go manually put it back to the actual number it should have been?

They're being managed manually, because we're still modifying it frequently enough that doing it via a deploy would be painful. And I personally think we wont ever be able to use a custom-metric for scaling. The min safe would change, but even something out of date would be a better default than 1.

@dturn
Copy link
Contributor Author

dturn commented Oct 11, 2018

Closing as wont implement.

@dturn dturn closed this as completed Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants