Service initial scale implementation for initialScale 0 use case#8613
Conversation
knative-prow-robot
left a comment
There was a problem hiding this comment.
@taragu: 0 warnings.
Details
In response to this:
/lint
Part 8 of #7682
Proposed Changes
Implementation of initial scale == 0 use case.
Release Note
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
| // Manually set the desired scale to 0 if initial scale is 0, because otherwise | ||
| // in computeActiveCondition, upon initial deploy we will miss the pc.want == 0 | ||
| // case and PA will go from inactive to activating. | ||
| // Can't use pa.Status.IsScaleTargetInitialized() here because it will only be false once |
There was a problem hiding this comment.
This seems like it means this happens on every scale-from-zero, not just the first? The only thing I can see here that stops this happening on every scale-from-zero is want = -1, but I think that could also be the case on an autoscaler restart (where we also have no metrics, but where I think we wouldnt want the InitialScale behaviour?)
There was a problem hiding this comment.
i tested this locally and we don't get want = -1 once we've scaled up from zero (after we scale down, PA's desired scale is 0), so this will only happen for the first scale-from-zero
There was a problem hiding this comment.
what about after autoscaler is restarted?
There was a problem hiding this comment.
Yes, you need to restart the autoscaler to get -1 again
| // Manually set the desired scale to 0 if initial scale is 0, because otherwise | ||
| // in computeActiveCondition, upon initial deploy we will miss the pc.want == 0 | ||
| // case and PA will go from inactive to activating. | ||
| // Can't use pa.Status.IsScaleTargetInitialized() here because it will only be false once |
There was a problem hiding this comment.
Yes, you need to restart the autoscaler to get -1 again
| if err != nil { | ||
| return fmt.Errorf("error checking pods for revision %s: %w", pa.Labels[serving.RevisionLabelKey], err) | ||
| } | ||
| // Manually set the desired scale to 0 if initial scale is 0, because otherwise |
There was a problem hiding this comment.
Should this maybe live inside scaler.go? I can see the SKS check is kind of dependent on this, but it feels more natural there.
There was a problem hiding this comment.
yes we can move this to scaler.go
| if pc.want > 0 || !pa.Status.IsInactive() { | ||
| pa.Status.MarkScaleTargetInitialized() | ||
| // SKS should already be active. | ||
| // In the initialScale > 0 case, SKS should already be active. However, when initialScale |
There was a problem hiding this comment.
Same here: if pc.want == 0 which is what happens when we set IS=0.
Not sure that SKS is important here in the comment.
There was a problem hiding this comment.
I was trying to say even after SKS has come up, we can still get pc.want = -1 here because we only override -1 with 0 once, before we MarkScaleTargetInitialized, but pc.want can remain -1 for a few seconds
| } else { | ||
| // Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler | ||
| // does not set any condition during reconciliation of a new generation | ||
| pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.") |
There was a problem hiding this comment.
How do we end up here? Since your change should only pass through pc.want == 0 case?
There was a problem hiding this comment.
We do hit this after minReady switches from 0 to 1. Even though PA has been marked inactive before in the pc.want==0 path, we'll get the NewObservedGenFailure if this is not set again.
There was a problem hiding this comment.
hm... so we set minReady to 0, then mark initial scale achieved to true, which means minReady becomes 1 on the next iteration, right?
But even on the next iteration pc.want should be 0 and go into the first case statement? Or it is -1 now and that's why we end up here?
There was a problem hiding this comment.
Yes exactly, pc.want is 0 for only one iteration. As soon as we mark ScaleTargetInitialized, it becomes -1 again.
There was a problem hiding this comment.
Can you expand the comment more, that this is a specific issue for initialScale=0 case and since there's no metric collection (no pods, no metrics), we end up with wantscale=-1, etc.
6519ebd to
272c331
Compare
|
@vagababov would you please take another look? |
| } else { | ||
| // Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler | ||
| // does not set any condition during reconciliation of a new generation | ||
| pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.") |
There was a problem hiding this comment.
hm... so we set minReady to 0, then mark initial scale achieved to true, which means minReady becomes 1 on the next iteration, right?
But even on the next iteration pc.want should be 0 and go into the first case statement? Or it is -1 now and that's why we end up here?
|
/retest |
1 similar comment
|
/retest |
vagababov
left a comment
There was a problem hiding this comment.
I have comment only issues now. I think every change in this PR is breaking some assumptions so it needs detailed and clear comments throughout.
| } else { | ||
| // Need to set a condition because otherwise we will get NewObservedGenFailure because the reconciler | ||
| // does not set any condition during reconciliation of a new generation | ||
| pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.") |
There was a problem hiding this comment.
Can you expand the comment more, that this is a specific issue for initialScale=0 case and since there's no metric collection (no pods, no metrics), we end up with wantscale=-1, etc.
| pa.Status.MarkActive() | ||
| // If initial scale is 0, we don't want to mark scale target initialized when | ||
| // want is still -1, because we will override -1 with 0 after SKS has been setup. | ||
| if minReady > 0 || (minReady == 0 && pc.want != -1) { |
There was a problem hiding this comment.
I think you want to mention that pc.want is going to be -1 after first iteration when initialScale=0. Otherwise it's not clear how we end up in this situation.
7330eab to
127961d
Compare
|
/assign @markusthoemmes |
markusthoemmes
left a comment
There was a problem hiding this comment.
This seems to work as expected but I couldn't help but feel that the overriding logic is getting super complex (i.e. us overriding certain parts in certain situations).
I've played around a little and came up with this potential simplification: 1bb4361. I have no clue if this isn't just hot garbage! You've got a much better view at the different states than I do currently. Does this make sense at all?
|
@markusthoemmes I tried out the diff but I'm getting the NewObservedGenFailure status when I tested this locally with deploying a ksvc with 0 initial scale. Line 261-265 ( serving/pkg/reconciler/autoscaling/kpa/kpa.go Line 265 in 6c1fa73 |
|
@taragu |
ad01d7a to
6d826af
Compare
|
The following jobs failed:
Failed non-flaky tests preventing automatic retry of pull-knative-serving-unit-tests: |
6d826af to
54da489
Compare
|
I've updated the PR with the new SKS ready status. When initial scale is 0, SKS won't be ready because the private service endpoints isn't populated. So for the condition of marking ScaleTargetInitialized, I'm using serving/pkg/reconciler/autoscaling/kpa/kpa.go Lines 240 to 244 in 54da489 @julz brought up the idea that we could discard any conditions for marking ScaleTargetInitialized, because we already have a separate status to indicate when SKS isn't ready and its reason, so it would be okay if we mark ScaleTargetInitialized prior to SKS becoming ready. @vagababov @markusthoemmes WDYT? |
|
The point is I believe that we want the SKS to be ready because that's a prerequisite of the service being able to scale up and receive requests 🤔 |
| // | -1 | >= min | >0 | active | active | | ||
| func computeActiveCondition(ctx context.Context, pa *pav1alpha1.PodAutoscaler, pc podCounts) { | ||
| minReady := activeThreshold(ctx, pa) | ||
| if pc.ready >= minReady && (pa.Status.IsSKSReady() || |
There was a problem hiding this comment.
Do we need to check IsSKSReady at all?
Since this no longer gates the readiness of the PA overall, we can set it (if we have enough ready pods — we've achieved target scale).
There was a problem hiding this comment.
Any chance we have a Reason to compare to instead of a Message? The latter makes me nervous as Messages may change over time.
There was a problem hiding this comment.
this should be removed now altogether, I think.
There was a problem hiding this comment.
yep, we got rid of this.
| switch { | ||
| case pc.want == 0: | ||
| if pa.Status.IsActivating() { | ||
| case pc.want == 0 || minReady == 0: |
There was a problem hiding this comment.
do we need to do || minReady==0? It seems that pc.want=0 is a superset of that? E.g. if minReady=0, pc.want will be 0 (the opposite is not true, though).
There was a problem hiding this comment.
We aren't overriding -1 with 0, so we are still hitting the pc.want = -1, minReady = 0 case.
There was a problem hiding this comment.
hm, I had that state somewhere in my head, but I guess we got rid of that :-)
There was a problem hiding this comment.
yeah originally we were overriding -1 with 0, but with Markus' simplification we are able to get rid of it.
| } else { | ||
| // This is for the initialScale 0 case. In the first iteration, minReady is 0, | ||
| // but for the following iterations, minReady is 1. pc.want will continue being | ||
| // -1 until we start receiving metrics, so we will end up here. | ||
| // Even though PA is already been marked as inactive in the first iteration, we | ||
| // still need to set it again. Otherwise reconciliation will fail with NewObservedGenFailure | ||
| // because we cannot go through one iteration of reconciliation without setting | ||
| // some status. | ||
| pa.Status.MarkInactive("NoTraffic", "The target is not receiving traffic.") |
There was a problem hiding this comment.
The way I read it, this is redundant.
On first iteration we'll enter the first switch case and mark PA as Inactive.
On the next ones we'll enter here (0 < 1). But as you mentioned pc.want=-1 and the pa.Status==Inactive: thus the if above will always evaluate to false (unless we receive requests and positive metrics and pc.want becomes positive). Thus you're just marking Inactive with Inactive again.
There was a problem hiding this comment.
Correct, we are marking Inactive with Inactive again here. This is still needed because we will be getting the NewObservedGenFailure during reconciliation post processing, because we cannot go through one iteration of reconciliation without setting some sort of status.
There was a problem hiding this comment.
Hm, so just updating the status with the same value updates ObsGen?
@whaught, Weston, if change in inputs didn't yield any change in status, why would this be an error?
There was a problem hiding this comment.
@whaught i'm not sure if my understanding is correct. I think it's because PA has been updated during the reconciliation, therefore there's a difference in ObsGen and Gen, but the status is not updated, which causes the failure: https://github.com/knative/pkg/blob/deb6b33d2a6c114f596f52630e85c475bf43abce/reconciler/reconcile_common.go#L50-L53
There was a problem hiding this comment.
When Spec changes, we reset Ready to unknown with a dummy message that the reconciler didn't set anything for a new spec before calling ReconcileKind. The reconciler is expected to set something upon observation of a new generation (or is left with the default message as a warning)
There was a problem hiding this comment.
I wonder the spec change in PA is 🤔
But anyway, it's interesting, since this does not change ready (before and after would be unknown).
|
The following is the coverage report on the affected files.
|
vagababov
left a comment
There was a problem hiding this comment.
Alright, beyond the strange requirement to call an idempotent status — which is probably outside of scope of this PR anyway — this looks good.
For some reason I thought we were setting want=0, if initialscale=0 && targetScaleNotReached (probably would simplify things, bu we can iterate), but I guess it was an intermediate step.
/lgtm
I'll let Markus approve, in case I am missing something.
Thanks Tara for the patience:)
| } | ||
|
|
||
| // IsSKSReady returns true if the PA condition denoting that SKS is ready. | ||
| func (pas *PodAutoscalerStatus) IsSKSReady() bool { |
There was a problem hiding this comment.
I think this isn't actually needed after latest refactor, but I guess it's a reasonable method to have anyway 🤷🏼
markusthoemmes
left a comment
There was a problem hiding this comment.
/lgtm
/approve
Thanks for the patience! Let's land this! 🎉
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markusthoemmes, taragu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
🎉🎉🎉🎉 |
/lint
Part 8 of #7682
Proposed Changes
Implementation of initial scale == 0 use case.
Release Note
/cc @julz @vagababov @markusthoemmes