Service initial scale implementation for initialScale 0 use case by taragu · Pull Request #8613 · knative/serving

taragu · 2020-07-13T14:59:07Z

/lint

Part 8 of #7682

Proposed Changes

Implementation of initial scale == 0 use case.

Release Note

/cc @julz @vagababov @markusthoemmes

knative-prow-robot

@taragu: 0 warnings.

Details

In response to this:

/lint

Part 8 of #7682

Proposed Changes

Implementation of initial scale == 0 use case.

Release Note
/cc @julz @vagababov @markusthoemmes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/reconciler/autoscaling/kpa/kpa_test.go

julz · 2020-07-13T15:10:44Z

pkg/reconciler/autoscaling/kpa/kpa.go

+	// Manually set the desired scale to 0 if initial scale is 0, because otherwise
+	// in computeActiveCondition, upon initial deploy we will miss the pc.want == 0
+	// case and PA will go from inactive to activating.
+	// Can't use pa.Status.IsScaleTargetInitialized() here because it will only be false once


This seems like it means this happens on every scale-from-zero, not just the first? The only thing I can see here that stops this happening on every scale-from-zero is want = -1, but I think that could also be the case on an autoscaler restart (where we also have no metrics, but where I think we wouldnt want the InitialScale behaviour?)

i tested this locally and we don't get want = -1 once we've scaled up from zero (after we scale down, PA's desired scale is 0), so this will only happen for the first scale-from-zero

what about after autoscaler is restarted?

Yes, you need to restart the autoscaler to get -1 again

pkg/reconciler/autoscaling/kpa/kpa.go

vagababov · 2020-07-13T16:43:13Z

pkg/reconciler/autoscaling/kpa/kpa.go

+	// Manually set the desired scale to 0 if initial scale is 0, because otherwise
+	// in computeActiveCondition, upon initial deploy we will miss the pc.want == 0
+	// case and PA will go from inactive to activating.
+	// Can't use pa.Status.IsScaleTargetInitialized() here because it will only be false once


Yes, you need to restart the autoscaler to get -1 again

vagababov · 2020-07-15T21:05:50Z

pkg/reconciler/autoscaling/kpa/kpa.go

 	if err != nil {
 		return fmt.Errorf("error checking pods for revision %s: %w", pa.Labels[serving.RevisionLabelKey], err)
 	}
+	// Manually set the desired scale to 0 if initial scale is 0, because otherwise


Should this maybe live inside scaler.go? I can see the SKS check is kind of dependent on this, but it feels more natural there.

yes we can move this to scaler.go

vagababov · 2020-07-15T21:10:28Z

pkg/reconciler/autoscaling/kpa/kpa.go

-		if pc.want > 0 || !pa.Status.IsInactive() {
-			pa.Status.MarkScaleTargetInitialized()
-			// SKS should already be active.
+		// In the initialScale > 0 case, SKS should already be active. However, when initialScale


Same here: if pc.want == 0 which is what happens when we set IS=0.

Not sure that SKS is important here in the comment.

I was trying to say even after SKS has come up, we can still get pc.want = -1 here because we only override -1 with 0 once, before we MarkScaleTargetInitialized, but pc.want can remain -1 for a few seconds

vagababov · 2020-07-15T21:16:26Z

pkg/reconciler/autoscaling/kpa/kpa.go

+		} else {
+			// Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler
+			// does not set any condition during reconciliation of a new generation
+			pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")


How do we end up here? Since your change should only pass through pc.want == 0 case?

We do hit this after minReady switches from 0 to 1. Even though PA has been marked inactive before in the pc.want==0 path, we'll get the NewObservedGenFailure if this is not set again.

hm... so we set minReady to 0, then mark initial scale achieved to true, which means minReady becomes 1 on the next iteration, right?
But even on the next iteration pc.want should be 0 and go into the first case statement? Or it is -1 now and that's why we end up here?

Yes exactly, pc.want is 0 for only one iteration. As soon as we mark ScaleTargetInitialized, it becomes -1 again.

Can you expand the comment more, that this is a specific issue for initialScale=0 case and since there's no metric collection (no pods, no metrics), we end up with wantscale=-1, etc.

taragu · 2020-07-17T16:00:42Z

@vagababov would you please take another look?

vagababov · 2020-07-17T20:45:10Z

pkg/reconciler/autoscaling/kpa/kpa.go

+		} else {
+			// Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler
+			// does not set any condition during reconciliation of a new generation
+			pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")


hm... so we set minReady to 0, then mark initial scale achieved to true, which means minReady becomes 1 on the next iteration, right?
But even on the next iteration pc.want should be 0 and go into the first case statement? Or it is -1 now and that's why we end up here?

pkg/reconciler/autoscaling/kpa/kpa_test.go

taragu · 2020-07-20T12:58:56Z

/retest

taragu · 2020-07-20T14:38:42Z

/retest

vagababov

I have comment only issues now. I think every change in this PR is breaking some assumptions so it needs detailed and clear comments throughout.

vagababov · 2020-07-20T16:55:42Z

pkg/reconciler/autoscaling/kpa/kpa.go

+		} else {
+			// Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler
+			// does not set any condition during reconciliation of a new generation
+			pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")


Can you expand the comment more, that this is a specific issue for initialScale=0 case and since there's no metric collection (no pods, no metrics), we end up with wantscale=-1, etc.

vagababov · 2020-07-20T17:02:58Z

pkg/reconciler/autoscaling/kpa/kpa.go

-			pa.Status.MarkActive()
+			// If initial scale is 0, we don't want to mark scale target initialized when
+			// want is still -1, because we will override -1 with 0 after SKS has been setup.
+			if minReady > 0 || (minReady == 0 && pc.want != -1) {


I think you want to mention that pc.want is going to be -1 after first iteration when initialScale=0. Otherwise it's not clear how we end up in this situation.

vagababov

/lgtm

taragu · 2020-07-21T10:52:33Z

/assign @markusthoemmes

markusthoemmes

This seems to work as expected but I couldn't help but feel that the overriding logic is getting super complex (i.e. us overriding certain parts in certain situations).

I've played around a little and came up with this potential simplification: 1bb4361. I have no clue if this isn't just hot garbage! You've got a much better view at the different states than I do currently. Does this make sense at all?

taragu · 2020-07-22T11:23:32Z

@markusthoemmes I tried out the diff but I'm getting the NewObservedGenFailure status when I tested this locally with deploying a ksvc with 0 initial scale. Line 261-265 (

serving/pkg/reconciler/autoscaling/kpa/kpa.go

Line 265 in 6c1fa73

pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")

) is still needed because we get there when minReady = 1 and pc.want = -1. Other than that this does simplify things a whole lot. Thank you so much for the suggestion!

vagababov · 2020-07-22T22:39:37Z

@taragu
What will happen if SKS will fail to initialize (sks.IsReady() == false) and initialscale = 0, would we make PA ready?

knative-test-reporter-robot · 2020-07-27T13:41:47Z

The following jobs failed:

Test name	Triggers	Retries
pull-knative-serving-unit-tests		0/3

Failed non-flaky tests preventing automatic retry of pull-knative-serving-unit-tests:

pkg/reconciler/autoscaling/kpa.[build failed]

taragu · 2020-07-27T15:26:14Z

I've updated the PR with the new SKS ready status. When initial scale is 0, SKS won't be ready because the private service endpoints isn't populated. So for the condition of marking ScaleTargetInitialized, I'm using

serving/pkg/reconciler/autoscaling/kpa/kpa.go

Lines 240 to 244 in 54da489

    
           if pc.ready >= minReady && (pa.Status.IsSKSReady() || 
        
           	// In the initial scale 0 case, there won't be any endpoints ready, and therefore SKS will still be not ready. 
        
           	(!pa.Status.IsSKSReady() && pa.Status.GetCondition(pav1alpha1.PodAutoscalerSKSReady).Message != noPrivateServiceName)) { 
        
           	pa.Status.MarkScaleTargetInitialized() 
        
           }

@julz brought up the idea that we could discard any conditions for marking ScaleTargetInitialized, because we already have a separate status to indicate when SKS isn't ready and its reason, so it would be okay if we mark ScaleTargetInitialized prior to SKS becoming ready. @vagababov @markusthoemmes WDYT?

markusthoemmes · 2020-07-27T15:34:02Z

The point is I believe that we want the SKS to be ready because that's a prerequisite of the service being able to scale up and receive requests 🤔

vagababov · 2020-07-27T17:08:52Z

pkg/reconciler/autoscaling/kpa/kpa.go

+//    | -1   | >= min | >0    | active     | active     |
 func computeActiveCondition(ctx context.Context, pa *pav1alpha1.PodAutoscaler, pc podCounts) {
 	minReady := activeThreshold(ctx, pa)
+	if pc.ready >= minReady && (pa.Status.IsSKSReady() ||


Do we need to check IsSKSReady at all?
Since this no longer gates the readiness of the PA overall, we can set it (if we have enough ready pods — we've achieved target scale).

Any chance we have a Reason to compare to instead of a Message? The latter makes me nervous as Messages may change over time.

this should be removed now altogether, I think.

yep, we got rid of this.

vagababov · 2020-07-27T17:20:26Z

pkg/reconciler/autoscaling/kpa/kpa.go

 	switch {
-	case pc.want == 0:
-		if pa.Status.IsActivating() {
+	case pc.want == 0 || minReady == 0:


do we need to do || minReady==0? It seems that pc.want=0 is a superset of that? E.g. if minReady=0, pc.want will be 0 (the opposite is not true, though).

We aren't overriding -1 with 0, so we are still hitting the pc.want = -1, minReady = 0 case.

hm, I had that state somewhere in my head, but I guess we got rid of that :-)

yeah originally we were overriding -1 with 0, but with Markus' simplification we are able to get rid of it.

vagababov · 2020-07-27T17:35:04Z

pkg/reconciler/autoscaling/kpa/kpa.go

+		} else {
+			// This is for the initialScale 0 case. In the first iteration, minReady is 0,
+			// but for the following iterations, minReady is 1. pc.want will continue being
+			// -1 until we start receiving metrics, so we will end up here.
+			// Even though PA is already been marked as inactive in the first iteration, we
+			// still need to set it again. Otherwise reconciliation will fail with NewObservedGenFailure
+			// because we cannot go through one iteration of reconciliation without setting
+			// some status.
+			pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.")


The way I read it, this is redundant.
On first iteration we'll enter the first switch case and mark PA as Inactive.
On the next ones we'll enter here (0 < 1). But as you mentioned pc.want=-1 and the pa.Status==Inactive: thus the if above will always evaluate to false (unless we receive requests and positive metrics and pc.want becomes positive). Thus you're just marking Inactive with Inactive again.

Correct, we are marking Inactive with Inactive again here. This is still needed because we will be getting the NewObservedGenFailure during reconciliation post processing, because we cannot go through one iteration of reconciliation without setting some sort of status.

Hm, so just updating the status with the same value updates ObsGen?
@whaught, Weston, if change in inputs didn't yield any change in status, why would this be an error?

@whaught i'm not sure if my understanding is correct. I think it's because PA has been updated during the reconciliation, therefore there's a difference in ObsGen and Gen, but the status is not updated, which causes the failure: https://github.com/knative/pkg/blob/deb6b33d2a6c114f596f52630e85c475bf43abce/reconciler/reconcile_common.go#L50-L53

When Spec changes, we reset Ready to unknown with a dummy message that the reconciler didn't set anything for a new spec before calling ReconcileKind. The reconciler is expected to set something upon observation of a new generation (or is left with the default message as a warning)

I wonder the spec change in PA is 🤔
But anyway, it's interesting, since this does not change ready (before and after would be unknown).

knative-metrics-robot · 2020-07-27T18:41:54Z

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/autoscaling/kpa/kpa.go	94.8%	94.9%	0.1
pkg/reconciler/autoscaling/kpa/scaler.go	94.2%	93.4%	-0.8

vagababov

Alright, beyond the strange requirement to call an idempotent status — which is probably outside of scope of this PR anyway — this looks good.
For some reason I thought we were setting want=0, if initialscale=0 && targetScaleNotReached (probably would simplify things, bu we can iterate), but I guess it was an intermediate step.

/lgtm

I'll let Markus approve, in case I am missing something.
Thanks Tara for the patience:)

julz

lgtm too - thanks a lot @taragu for all the work on what turned out to be an absolute epic of a feature track!

julz · 2020-07-28T06:02:08Z

pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

 }

+// IsSKSReady returns true if the PA condition denoting that SKS is ready.
+func (pas *PodAutoscalerStatus) IsSKSReady() bool {


I think this isn't actually needed after latest refactor, but I guess it's a reasonable method to have anyway 🤷🏼

markusthoemmes

/lgtm
/approve

Thanks for the patience! Let's land this! 🎉

knative-prow-robot · 2020-07-28T07:28:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markusthoemmes, taragu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/apis/autoscaling/OWNERS~~ [markusthoemmes]
~~pkg/reconciler/autoscaling/OWNERS~~ [markusthoemmes]
~~test/OWNERS~~ [markusthoemmes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

julz · 2020-07-28T07:35:57Z

🎉🎉🎉🎉

knative-prow-robot requested review from julz, markusthoemmes and vagababov July 13, 2020 14:59

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jul 13, 2020

knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/API API objects and controllers area/autoscale labels Jul 13, 2020

knative-prow-robot reviewed Jul 13, 2020

View reviewed changes

knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Jul 13, 2020

julz reviewed Jul 13, 2020

View reviewed changes

vagababov reviewed Jul 13, 2020

View reviewed changes

knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 15, 2020

vagababov reviewed Jul 15, 2020

View reviewed changes

taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from 6519ebd to 272c331 Compare July 16, 2020 16:37

vagababov reviewed Jul 17, 2020

View reviewed changes

vagababov reviewed Jul 20, 2020

View reviewed changes

taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from 7330eab to 127961d Compare July 20, 2020 17:32

vagababov reviewed Jul 20, 2020

View reviewed changes

knative-prow-robot assigned vagababov Jul 20, 2020

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2020

knative-prow-robot assigned markusthoemmes Jul 21, 2020

markusthoemmes reviewed Jul 22, 2020

View reviewed changes

knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020

taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from ad01d7a to 6d826af Compare July 27, 2020 13:36

Tara Gu added 11 commits July 27, 2020 09:48

Service initial scale implementation for initialScale 0 use case

a10e93e

Fix tests and minReady comparison

b241828

Limit override of want -1 -> 0

16add1e

Fix tests, and move -1 -> 0 into scaler.go

dbdbf17

Fix after rebase

6cd7c82

use const

287cf12

Update comments

9930bbf

Update activeness table comment

570eae3

Simplify

0189945

Use SKS readiness condition for computing active condition

d021d2c

Fix after rebase

54da489

taragu force-pushed the tara-init-scale-0-initial-scale-0 branch from 6d826af to 54da489 Compare July 27, 2020 14:07

vagababov reviewed Jul 27, 2020

View reviewed changes

Remove SKS readiness check from marking ScaleTargetInitialized

4213378

vagababov reviewed Jul 27, 2020

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2020

julz reviewed Jul 28, 2020

View reviewed changes

markusthoemmes approved these changes Jul 28, 2020

View reviewed changes

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2020

knative-prow-robot merged commit e9ae313 into knative:master Jul 28, 2020

taragu deleted the tara-init-scale-0-initial-scale-0 branch July 29, 2020 11:33

taragu mentioned this pull request Aug 6, 2020

Don't scale to 1 upon deploy #4098

Closed

julz mentioned this pull request Apr 15, 2021

Add julz to API Approvers knative/community#573

Merged

Conversation

taragu commented Jul 13, 2020

Proposed Changes

Uh oh!

knative-prow-robot left a comment

Choose a reason for hiding this comment

Proposed Changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taragu Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taragu commented Jul 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

taragu commented Jul 20, 2020

Uh oh!

taragu commented Jul 20, 2020

Uh oh!

vagababov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vagababov left a comment

Choose a reason for hiding this comment

Uh oh!

taragu commented Jul 21, 2020

Uh oh!

markusthoemmes left a comment

Choose a reason for hiding this comment

Uh oh!

taragu commented Jul 22, 2020

Uh oh!

vagababov commented Jul 22, 2020

Uh oh!

knative-test-reporter-robot commented Jul 27, 2020

Uh oh!

taragu commented Jul 27, 2020

Uh oh!

markusthoemmes commented Jul 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taragu Jul 16, 2020 •

edited

Loading