Skip to content

Service initial scale implementation for initialScale 0 use case#8613

Merged
knative-prow-robot merged 12 commits intoknative:masterfrom
taragu:tara-init-scale-0-initial-scale-0
Jul 28, 2020
Merged

Service initial scale implementation for initialScale 0 use case#8613
knative-prow-robot merged 12 commits intoknative:masterfrom
taragu:tara-init-scale-0-initial-scale-0

Conversation

@taragu
Copy link
Contributor

@taragu taragu commented Jul 13, 2020

/lint

Part 8 of #7682

Proposed Changes

Implementation of initial scale == 0 use case.

Release Note


/cc @julz @vagababov @markusthoemmes

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 13, 2020
@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers area/autoscale labels Jul 13, 2020
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taragu: 0 warnings.

Details

In response to this:

/lint

Part 8 of #7682

Proposed Changes

Implementation of initial scale == 0 use case.

Release Note


/cc @julz @vagababov @markusthoemmes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Jul 13, 2020
// Manually set the desired scale to 0 if initial scale is 0, because otherwise
// in computeActiveCondition, upon initial deploy we will miss the pc.want == 0
// case and PA will go from inactive to activating.
// Can't use pa.Status.IsScaleTargetInitialized() here because it will only be false once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it means this happens on every scale-from-zero, not just the first? The only thing I can see here that stops this happening on every scale-from-zero is want = -1, but I think that could also be the case on an autoscaler restart (where we also have no metrics, but where I think we wouldnt want the InitialScale behaviour?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tested this locally and we don't get want = -1 once we've scaled up from zero (after we scale down, PA's desired scale is 0), so this will only happen for the first scale-from-zero

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about after autoscaler is restarted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you need to restart the autoscaler to get -1 again

// Manually set the desired scale to 0 if initial scale is 0, because otherwise
// in computeActiveCondition, upon initial deploy we will miss the pc.want == 0
// case and PA will go from inactive to activating.
// Can't use pa.Status.IsScaleTargetInitialized() here because it will only be false once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you need to restart the autoscaler to get -1 again

@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 15, 2020
if err != nil {
return fmt.Errorf("error checking pods for revision %s: %w", pa.Labels[serving.RevisionLabelKey], err)
}
// Manually set the desired scale to 0 if initial scale is 0, because otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this maybe live inside scaler.go? I can see the SKS check is kind of dependent on this, but it feels more natural there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can move this to scaler.go

if pc.want > 0 || !pa.Status.IsInactive() {
pa.Status.MarkScaleTargetInitialized()
// SKS should already be active.
// In the initialScale > 0 case, SKS should already be active. However, when initialScale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: if pc.want == 0 which is what happens when we set IS=0.

Not sure that SKS is important here in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to say even after SKS has come up, we can still get pc.want = -1 here because we only override -1 with 0 once, before we MarkScaleTargetInitialized, but pc.want can remain -1 for a few seconds

Comment on lines +270 to +267
} else {
// Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler
// does not set any condition during reconciliation of a new generation
pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we end up here? Since your change should only pass through pc.want == 0 case?

Copy link
Contributor Author

@taragu taragu Jul 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do hit this after minReady switches from 0 to 1. Even though PA has been marked inactive before in the pc.want==0 path, we'll get the NewObservedGenFailure if this is not set again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... so we set minReady to 0, then mark initial scale achieved to true, which means minReady becomes 1 on the next iteration, right?
But even on the next iteration pc.want should be 0 and go into the first case statement? Or it is -1 now and that's why we end up here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, pc.want is 0 for only one iteration. As soon as we mark ScaleTargetInitialized, it becomes -1 again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand the comment more, that this is a specific issue for initialScale=0 case and since there's no metric collection (no pods, no metrics), we end up with wantscale=-1, etc.

@taragu taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from 6519ebd to 272c331 Compare July 16, 2020 16:37
@taragu
Copy link
Contributor Author

taragu commented Jul 17, 2020

@vagababov would you please take another look?

Comment on lines +270 to +267
} else {
// Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler
// does not set any condition during reconciliation of a new generation
pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... so we set minReady to 0, then mark initial scale achieved to true, which means minReady becomes 1 on the next iteration, right?
But even on the next iteration pc.want should be 0 and go into the first case statement? Or it is -1 now and that's why we end up here?

@taragu
Copy link
Contributor Author

taragu commented Jul 20, 2020

/retest

1 similar comment
@taragu
Copy link
Contributor Author

taragu commented Jul 20, 2020

/retest

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have comment only issues now. I think every change in this PR is breaking some assumptions so it needs detailed and clear comments throughout.

Comment on lines +270 to +267
} else {
// Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler
// does not set any condition during reconciliation of a new generation
pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand the comment more, that this is a specific issue for initialScale=0 case and since there's no metric collection (no pods, no metrics), we end up with wantscale=-1, etc.

pa.Status.MarkActive()
// If initial scale is 0, we don't want to mark scale target initialized when
// want is still -1, because we will override -1 with 0 after SKS has been setup.
if minReady > 0 || (minReady == 0 && pc.want != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to mention that pc.want is going to be -1 after first iteration when initialScale=0. Otherwise it's not clear how we end up in this situation.

@taragu taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from 7330eab to 127961d Compare July 20, 2020 17:32
Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2020
@taragu
Copy link
Contributor Author

taragu commented Jul 21, 2020

/assign @markusthoemmes

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work as expected but I couldn't help but feel that the overriding logic is getting super complex (i.e. us overriding certain parts in certain situations).

I've played around a little and came up with this potential simplification: 1bb4361. I have no clue if this isn't just hot garbage! You've got a much better view at the different states than I do currently. Does this make sense at all?

@taragu
Copy link
Contributor Author

taragu commented Jul 22, 2020

@markusthoemmes I tried out the diff but I'm getting the NewObservedGenFailure status when I tested this locally with deploying a ksvc with 0 initial scale. Line 261-265 (

pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")
) is still needed because we get there when minReady = 1 and pc.want = -1. Other than that this does simplify things a whole lot. Thank you so much for the suggestion!

@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020
@vagababov
Copy link
Contributor

@taragu
What will happen if SKS will fail to initialize (sks.IsReady() == false) and initialscale = 0, would we make PA ready?

@taragu taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from ad01d7a to 6d826af Compare July 27, 2020 13:36
@knative-test-reporter-robot

The following jobs failed:

Test name Triggers Retries
pull-knative-serving-unit-tests 0/3

Failed non-flaky tests preventing automatic retry of pull-knative-serving-unit-tests:

pkg/reconciler/autoscaling/kpa.[build failed]

@taragu taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from 6d826af to 54da489 Compare July 27, 2020 14:07
@taragu
Copy link
Contributor Author

taragu commented Jul 27, 2020

I've updated the PR with the new SKS ready status. When initial scale is 0, SKS won't be ready because the private service endpoints isn't populated. So for the condition of marking ScaleTargetInitialized, I'm using

if pc.ready >= minReady && (pa.Status.IsSKSReady() ||
// In the initial scale 0 case, there won't be any endpoints ready, and therefore SKS will still be not ready.
(!pa.Status.IsSKSReady() && pa.Status.GetCondition(pav1alpha1.PodAutoscalerSKSReady).Message != noPrivateServiceName)) {
pa.Status.MarkScaleTargetInitialized()
}

@julz brought up the idea that we could discard any conditions for marking ScaleTargetInitialized, because we already have a separate status to indicate when SKS isn't ready and its reason, so it would be okay if we mark ScaleTargetInitialized prior to SKS becoming ready. @vagababov @markusthoemmes WDYT?

@markusthoemmes
Copy link
Contributor

The point is I believe that we want the SKS to be ready because that's a prerequisite of the service being able to scale up and receive requests 🤔

// | -1 | >= min | >0 | active | active |
func computeActiveCondition(ctx context.Context, pa *pav1alpha1.PodAutoscaler, pc podCounts) {
minReady := activeThreshold(ctx, pa)
if pc.ready >= minReady && (pa.Status.IsSKSReady() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check IsSKSReady at all?
Since this no longer gates the readiness of the PA overall, we can set it (if we have enough ready pods — we've achieved target scale).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we have a Reason to compare to instead of a Message? The latter makes me nervous as Messages may change over time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be removed now altogether, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we got rid of this.

switch {
case pc.want == 0:
if pa.Status.IsActivating() {
case pc.want == 0 || minReady == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do || minReady==0? It seems that pc.want=0 is a superset of that? E.g. if minReady=0, pc.want will be 0 (the opposite is not true, though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't overriding -1 with 0, so we are still hitting the pc.want = -1, minReady = 0 case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I had that state somewhere in my head, but I guess we got rid of that :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah originally we were overriding -1 with 0, but with Markus' simplification we are able to get rid of it.

Comment on lines +259 to +267
} else {
// This is for the initialScale 0 case. In the first iteration, minReady is 0,
// but for the following iterations, minReady is 1. pc.want will continue being
// -1 until we start receiving metrics, so we will end up here.
// Even though PA is already been marked as inactive in the first iteration, we
// still need to set it again. Otherwise reconciliation will fail with NewObservedGenFailure
// because we cannot go through one iteration of reconciliation without setting
// some status.
pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I read it, this is redundant.
On first iteration we'll enter the first switch case and mark PA as Inactive.
On the next ones we'll enter here (0 < 1). But as you mentioned pc.want=-1 and the pa.Status==Inactive: thus the if above will always evaluate to false (unless we receive requests and positive metrics and pc.want becomes positive). Thus you're just marking Inactive with Inactive again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we are marking Inactive with Inactive again here. This is still needed because we will be getting the NewObservedGenFailure during reconciliation post processing, because we cannot go through one iteration of reconciliation without setting some sort of status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so just updating the status with the same value updates ObsGen?
@whaught, Weston, if change in inputs didn't yield any change in status, why would this be an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whaught i'm not sure if my understanding is correct. I think it's because PA has been updated during the reconciliation, therefore there's a difference in ObsGen and Gen, but the status is not updated, which causes the failure: https://github.com/knative/pkg/blob/deb6b33d2a6c114f596f52630e85c475bf43abce/reconciler/reconcile_common.go#L50-L53

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Spec changes, we reset Ready to unknown with a dummy message that the reconciler didn't set anything for a new spec before calling ReconcileKind. The reconciler is expected to set something upon observation of a new generation (or is left with the default message as a warning)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder the spec change in PA is 🤔
But anyway, it's interesting, since this does not change ready (before and after would be unknown).

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/autoscaling/kpa/kpa.go 94.8% 94.9% 0.1
pkg/reconciler/autoscaling/kpa/scaler.go 94.2% 93.4% -0.8

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, beyond the strange requirement to call an idempotent status — which is probably outside of scope of this PR anyway — this looks good.
For some reason I thought we were setting want=0, if initialscale=0 && targetScaleNotReached (probably would simplify things, bu we can iterate), but I guess it was an intermediate step.

/lgtm

I'll let Markus approve, in case I am missing something.
Thanks Tara for the patience:)

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2020
Copy link
Contributor

@julz julz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm too - thanks a lot @taragu for all the work on what turned out to be an absolute epic of a feature track!

}

// IsSKSReady returns true if the PA condition denoting that SKS is ready.
func (pas *PodAutoscalerStatus) IsSKSReady() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't actually needed after latest refactor, but I guess it's a reasonable method to have anyway 🤷🏼

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks for the patience! Let's land this! 🎉

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markusthoemmes, taragu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2020
@julz
Copy link
Contributor

julz commented Jul 28, 2020

🎉🎉🎉🎉

@knative-prow-robot knative-prow-robot merged commit e9ae313 into knative:master Jul 28, 2020
@taragu taragu deleted the tara-init-scale-0-initial-scale-0 branch July 29, 2020 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants