Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicaset awareness #107

Merged
merged 1 commit into from
Jul 4, 2017
Merged

Replicaset awareness #107

merged 1 commit into from
Jul 4, 2017

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented Jun 5, 2017

@Shopify/cloudplatform

This PR makes KubernetesResource::Deployment aware of replica sets. This should fix a number of outstanding issues:

  • Pod warnings get shown for pods that are being shut down, in the case where the last deploy was bad and the current one is actually succeeding (very confusing output)
  • The way we try to identify a representative pod for the purpose of outputting related events and logs on failed deploy is sketchy (Rework deploy output #98 (comment)) and could theoretically still select a pod from the wrong RS.
  • Rolling updates currently cannot fail, and will never time out on the basis of their pod states (notes in code)

Fixes #36
Fixes #42 (with small amount of extra code + test)

Problem

We currently make the link between a deployment and its pods by looking for a "name" label. This is super fragile, since we're not using the full selector set, plus the presence of that label is a Shopify convention, not a k8s requirement. Furthermore, even using the full selector set does not guarantee you're looking at the right pods if you've gotten yourself into a really bad state (selector overlap). Ideally we'd connect Deployments to the pods we care about in a robust way; this is much easier to do as of Kubernetes 1.6 (see "Additional context" below).

Solution

This PR uses the full set of label selectors to grab all possibly relevant RS. It then uses the new ownerReferences information to select the RS it owns, and the deployment.kubernetes.io/revision to select the current RS. That RS in turn uses label selectors and ownerReferences to list its pods.

Additional context

@KnVerey KnVerey force-pushed the replicaset_awareness branch 2 times, most recently from 2fd72e8 to e710be8 Compare June 5, 2017 20:18
end

def deploy_failed?
@pods.present? && @pods.all?(&:deploy_failed?)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and deploy_timed_out?) still will basically never fail on account of the pods, since the Pod timeout is higher and the failure condition is super narrow (@phase == "Failed"). However, my next task is to make pod failures smarter / more aggressive, and at that point this should make a huge difference in how often our users hit the Deployment timeout.

own_events.merge(@representative_pod.fetch_events)
def fetch_logs
return {} unless @latest_rs.present?
@latest_rs.fetch_logs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubectl get logs deployment/web -c nginx (or whatever) is a valid command. However, looking at the source, I think it might favour pods from the old RS under some circumstances, and certainly won't exclude them. https://github.com/kubernetes/kubernetes/blob/dbd1503b6546b0b6048787bad0e63515f3281506/pkg/controller/controller_utils.go#L669-L698

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there's no kubectl get logs, you probably meant kubectl logs. Though I agree that kubectl logs deployment/web -c web can be ambiguous.

How did it work before for a deployment with 100 replicas (== pods)? Would representative_pod be one of a hundred?

Do we fetch logs for the developers to tell them what went wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant kubectl logs. And yes, we're only fetching logs as debug information for these resource types when they fail (unlike unmanaged pods, for which we always dump logs). representative_pod was a hack to be able to introduce the debug output without requiring this PR to be ready in addition to the extensive changes that did go in at the time. It attempted to select a pod from the new replicaSet by looking at pod creationTimestamps.

own_events.merge(@pods.first.fetch_events)
end

def fetch_logs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the source for how kubectl selects which pod to dump logs from. The gist is that it is trying to select a pod that is more likely to actually have logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ for checking it and for the reference to k8s source that was not easy to find

@KnVerey KnVerey requested review from kirs and wfarr June 5, 2017 20:27
@representative_pod ? @representative_pod.fetch_logs : {}
def fetch_events
return {} unless @latest_rs.present?
@latest_rs.fetch_events
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could include deployment events in here too, but I don't think they'd generally be useful (typically, they're about "Scaled replicaSet foo"), and there can be a ton of them since fetch_events doesn't filter by time right now (though I'm considering adding that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other PR adds filtering by time, so we can actually keep these events after all.

"--since-time=#{@deploy_started.to_datetime.rfc3339}",
"--tail=#{LOG_LINE_COUNT}"
)
container_logs["#{id}/#{container_name}"] = out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're displaying logs for all the containers in the pod, might make sense to display shorter ones first to avoid having a chatty container drown out a less chatty and possibly more relevant one. Visually separating individual container logs would probably be helpful as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

image

  • each gets its own section, with a heading instead of tags on every line
  • sorted by line count
  • identifier format container 'foo' instead of the full resource path (Deployment/web/foo)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ 👍

Copy link

@camilo camilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of nitpicks I like the idea tho

.slice("updatedReplicas", "replicas", "availableReplicas", "unavailableReplicas")
@status = @rollout_data.map { |replica_state, num| "#{num} #{replica_state}" }.join(", ")
deployment_data = JSON.parse(raw_json)
latest_rs_data = get_latest_rs(deployment_data)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates SLA a bit IMO, why not have something like latest_rs(deployment_data) return an already formed ReplicaSet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or find_latest_rs since this method actually gets multiple RS and finds the right one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@found = st.success?
@rollout_data = {}
@rollout_data = { "replicas" => 0 }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect many calls to sycn on the same instance? or should this be set in the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sync will be called many, many times on the same instance--about once every three seconds from the beginning of the deploy until the time the resource succeeds/fails. The exception to this is resources that were created to model subresources of the thing that was actually in the template set, e.g. managed replica sets or pods. We'll get new instances of those on every sync.

@parent = parent
@logger = logger
@pods = []
@rollout_data = { "replicas" => 0 }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, this one does do the intialization here, there must be a reason I'm not seeing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason has to do with the possible origins of these resource types:

  • Deployments always come from a file in your deploy directory and get their observed state via sync.
  • ReplicaSets usually show up via deployments, but it is possible to deploy them separately, so I've supported that here (and added a test).
  • Pods are very commonly created both ways: unmanaged from a file in your deploy directory, or via a deployment/replicaSet.

For deployments, we have it easy: we see the file, create the instance, and then call sync on it repeatedly to reset and update the observed state. For the other two, there is the additional possibility that they are being created when their parent syncs, and that the parent already has the data on their observed state. In that case, we create the instance and then call interpret_json_data with that data; sync is never called. Effectively, interpret_json_data is the common core between the two flows. I'll investigate making it private / clarifying these two flows.

end
end

def interpret_json_data(rs_data)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method need to be public?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it now, because the data is set into the objects after intialization, as I mentioned in the other comment, sounds like it could be an optional param to the constructor.

end

def interpret_json_data(rs_data)
pods_data = get_pods(rs_data)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be argued out of this one, but same as in the RS class, this should probably just return the objects.

@@ -192,5 +193,10 @@ def pretty_status
def kubectl
@kubectl ||= Kubectl.new(namespace: @namespace, context: @context, logger: @logger, log_failure_by_default: false)
end

def template
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed, and why is it okay to return nil if the file is not present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it here because multiple subclasses in this PR use it:

  • ReplicaSet uses it to grab the container names in the case where it was not derived from a deployment (managed ReplicaSets will not have a template file)
  • Service uses it to find the related deployment so it can be smarter about how many endpoints to expect (notably so that it can expect zero when appropriate)

Your comment make me think that perhaps I should move the fallback right into this method: if we have a file, load it from file; if we don't, fetch it from the API.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change; should it be memoized? or would that be roflscale ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memoizing should be safe... From file, the result obviously won't change. Although resources do change cluster-side, the ones that don't have a file are child resources for which that shouldn't be true. For example, if the initial status check (before the deploy makes any changes) on a deployment resource memoized a template from the cluster, that'd be bad, since that template would change when the deploy proceeds. But the child resource objects are replaced by their parents every time the parent syncs.

parent: "#{@name.capitalize} deployment",
logger: @logger
)
@latest_rs.interpret_json_data(latest_rs_data)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I even think this should be an optional(?) param to the constructor:

ReplicaSet.new(blah blah... data: json)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That ^ or @latest_rs = ReplicaSet.from_json(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember trying the ReplicaSet.from_json model at one point in this work and running into difficulties... but I don't remember what they were, so I'll try again. :P

@status = nil
@representative_pod = nil
@pods = []
@latest_rs = nil

if @found
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know is not a change here, but why is there no defined behaviour for if not found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour in that case is a reset, which is accomplished by the code above. Basically, we always reset, and then if we find new data, we repopulate using it.

@@ -0,0 +1,27 @@
---
apiVersion: extensions/v1beta1
kind: ReplicaSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case when we would have a manifest for ReplicaSet? I thought it's created automatically by Deployment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't encountered a use case, but it is definitely possible to create them separately.

def fetch_logs
@representative_pod ? @representative_pod.fetch_logs : {}
def fetch_events
return {} unless @latest_rs.present?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that the deployment would always have a replica set link. Should we raise if it doesn't (catching the possible inconsistent state)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think we should raise. For one, the purpose of fetch_events is collection of debug information, so raising during that would throw a wrench in our attempt to be helpful. For another, since the RS creation happens asynchronously, it is theoretically possible for the RS not to exist (yet) at a given point.

return unless st.success?

all_rs_data = JSON.parse(raw_json)["items"]
current_revision = deployment_data["metadata"]["annotations"]["deployment.kubernetes.io/revision"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be handy to use deployment_data.dig here (up to you)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a Ruby 2.3 feature, so I can't (unless we decide to break with the version supported by shipit-engine, or that gem bumped its requirement and I didn't notice)

interpret_json_data(pod_data)
else # reset
@status = @phase = nil
@ready = false
@ready = @found = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the @found flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used by exists?... but I'm not sure why I moved it here if that's what you mean. It should already be set, so I'll change it back.

Copy link
Contributor

@kirs kirs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on how we can rely on Kubernetes internals and ways to use ownerReference

Most of the things I'd note were already noted by Camilo. I have some minor questions, otherwise LGTM.

On the side note this is one of the best PRs I've seen in terms of referencing the k8s sources and digging into the actual behaviour ❤️ I appreciate the time you spent reading Go sources.

])
end

def test_can_deploy_deployment_with_zero_replicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to mention that it's extremely important to make sure that it works with multiple long-running revisions / pods that take more time to shutdown / custom terminationGracePeriod, but it's already tested by test_long_running_deployment and if the test passes I don't have much to add.

sorted_logs.each do |identifier, log_lines|
helpful_info << " - Logs from container '#{identifier}' (last #{LOG_LINE_COUNT} lines shown):"
log_lines.each do |line|
helpful_info << " #{line}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now looks like this:
image
[...event noise addressed by other PR...]
image

end
display_logs if unmanaged? && deploy_succeeded?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of this output:

image

@KnVerey
Copy link
Contributor Author

KnVerey commented Jun 13, 2017

I made some architectural changes to this PR based on your questions and feedback. Now, whether the resource was created from a deploy template or during a parent resource's sync, its observed state is recorded through its own sync method.interpret_json_data is gone. This way, we have a more unified code path for the resources that can be created both ways (i.e. Pods, ReplicaSets). I'm hoping this will be a lot clearer--lmk what you think!

parent: "#{@name.capitalize} deployment",
logger: @logger,
deploy_started: @deploy_started
).tap { |rs| rs.sync(latest_rs_data) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also make a method like this if you still think it is better:

    def self.from_json(**args)
      json_data = args.delete(:json_data)
      inst = new(**args)
      inst.sync(json_data)
      inst
    end

I'm not sure it adds much value, since most of the initializer params aren't from the json, but I'm not against it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is fine as is, what I don't like is tap I find that method to be kinda not very useful sintactic sugar; WDYJ:

rs = RS.new(..)
rs.sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to return the rs, so it'd be three lines instead of one 😉
Though our style guide doesn't mention it, I've heard other Shopifolk hating on tap in the past. I can change it.

@@ -16,7 +16,11 @@ def sync
end

def deploy_succeeded?
@num_endpoints > 0
if related_deployment_replicas
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this change is to be able to handle zero-replica deployments (that have been temporarily scaled down for some reason). I'm second-guessing the what I've done here though--I think it might be too aggressive. If the deployment has rolled out and the service has some endpoints, it's not really valuable to hold up the deploy, is it?

WDTY think about this logic instead:

if exposes_zero_replica_deployment?
 @num_endpoints == 0
else
  @num_endpoints > 0
end

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the explicitness of exposes_zero_replica_deployment? over the current more implicit implementation

Copy link

@camilo camilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more nitpicks, you can choose to ignore (except the one about resource knowing about deploy time)

@@ -192,5 +193,10 @@ def pretty_status
def kubectl
@kubectl ||= Kubectl.new(namespace: @namespace, context: @context, logger: @logger, log_failure_by_default: false)
end

def template
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change; should it be memoized? or would that be roflscale ?

end
deployment_data = JSON.parse(raw_json)
@latest_rs = find_latest_rs(deployment_data)
@rollout_data = { "replicas" => 0 }.merge!(deployment_data["status"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb nitpick: why prefer the mutating version of merge here? you are using the return value anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no good reason. I refactored this section a bunch of times, so maybe there was some intermediate version that didn't use the return value. I'll fix it.

parent: "#{@name.capitalize} deployment",
logger: @logger,
deploy_started: @deploy_started
).tap { |rs| rs.sync(latest_rs_data) }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is fine as is, what I don't like is tap I find that method to be kinda not very useful sintactic sugar; WDYJ:

rs = RS.new(..)
rs.sync

@name = name
@namespace = namespace
@context = context
@file = file
@parent = parent
@deploy_started = deploy_started
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting semantic change here; why would a KubernetesResource be aware of when a deploy started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not actually new. I added it as a parameter to save the parent resource from having to call the setter immediately in these cases. I'll change it back if you think this introduces confusion. See here in the diff for the setter being used instead.

As for why it is needed, it is primarily so that they know when to timeout, although we also use it to filter logs and events. It's not the global deploy start time; it is the time that specific resource was apply'd. For resources in your yamls, the instance is created when we first read in the templates, and this attribute is it is set later, here.

@@ -16,7 +16,11 @@ def sync
end

def deploy_succeeded?
@num_endpoints > 0
if related_deployment_replicas
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the explicitness of exposes_zero_replica_deployment? over the current more implicit implementation

@KnVerey
Copy link
Contributor Author

KnVerey commented Jun 28, 2017

@camilo @kirs This has been rebased on the refactor I extracted out of it based on your feedback. What's left should look pretty familiar, but could you please take a final look? 😄

@KnVerey
Copy link
Contributor Author

KnVerey commented Jul 4, 2017

@ibawt can you please review this one?

def fetch_events
own_events = super
return own_events unless @pods.present?
own_events.merge(@pods.first.fetch_events)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pods.first: does that mean we'll lose some data or is it the same for the entire thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that the events will be very similar across the entire set of pods. It is possible some pods will be more interesting than others (e.g. the case you saw where a pod landed on a bad node), but at this stage I don't think we have a nice way of figuring out which one. After my next PR, we could deliberately try to find a failed pod (if any) and use its events... 🤔 Did you have any ideas around this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ideas just wanted to make sure I knew the reason.

Copy link
Contributor

@ibawt ibawt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code LGTM


latest_rs_data = all_rs_data.find do |rs|
rs["metadata"]["ownerReferences"].any? { |ref| ref["uid"] == deployment_data["metadata"]["uid"] } &&
rs["metadata"]["annotations"]["deployment.kubernetes.io/revision"] == current_revision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check the revision?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using ownerReferences to be completely certain that the RS belongs to the deployment, and revision to select the new one, as opposed to the n other RS that could exists for this deployment. This seemed like the best available option based one the k8s source and issues I read, but I could be missing something. Are you aware of a better way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I somehow thought it's the k8s revision

def test_can_deploy_deployment_with_zero_replicas
success = deploy_fixtures("hello-cloud", subset: ["configmap-data.yml", "web.yml.erb"]) do |fixtures|
web = fixtures["web.yml.erb"]["Deployment"].first
web["spec"]["replicas"] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A use case of zero replicas that I remember is:

  1. Create a deployment with unspecified replicas field
  2. The replicas number is scaled dynamically (can sometimes be zero)
  3. At the moment of the deploy it's zero

I see the test works with YAML that explicitly sets replicas to zero, but would it also support the case of dynamic replicas value set to zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work fine. The implementation uses selectors to grab a live copy of the related deployment's template, so it should work even if the deployment wasn't managed by kubernetes-deploy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 great work!

@KnVerey KnVerey merged commit 52c2bce into master Jul 4, 2017
@KnVerey KnVerey deleted the replicaset_awareness branch July 4, 2017 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zero-replicas deployments never succeed Deployment rollout monitoring revamp
5 participants