Nil pointer dereference in KCM after v1 HPA patch request #107038

natherz97 · 2021-12-15T00:26:32Z

What happened?

If we create a v2beta2 HPA object with fields specified for the scaleUp or scaleDown behavior, these values can be modified through the v1 version of the same object by changing the annotation for autoscaling.alpha.kubernetes.io/behavior. If we patch the scaleUp or scaleDown field in the annotation to null, the patch request will get accepted without the default values for scaleUp or scaleDown being applied. This will result in a panic in KCM when it tries to deference a nil pointer:

E1207 02:43:43.082791      10 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 2238 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41c0960, 0x747b350)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x41c0960, 0x747b350)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).stabilizeRecommendationWithBehaviors(0xc000a700b0, 0xc0004bd760, 0x1a, 0x0, 0xc001a4c570, 0x500000004, 0x100000004, 0x0, 0x7fe0a1a18988, 0x7fe0a1a26ef8, ...)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:885 +0x3c
...

This is panic is occurring here: https://github.com/kubernetes/kubernetes/blob/v1.21.2/pkg/controller/podautoscaler/horizontal.go#L885

What did you expect to happen?

The patch request to the v1 HPA object should have applied the default value for the scaleUp or scaleDown fields when the annotation edit set it to null. When the v2beta2 object is updated, omitting the scaleUp or scaleDown properties will result in default values being applied so it is assumed that no fields in this struct are nil:

kubernetes/pkg/apis/autoscaling/v2/defaults.go

Line 104 in 6ac2d8e

// We guarantee that no pointer in the structure will have the 'nil' value

How can we reproduce it (as minimally and precisely as possible)?

Using v1.21.2:

Create an example HPA object using autoscaling/v2beta2

cat <<EOF | kubectl apply -n kube-system -f -
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: coredns-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coredns
  minReplicas: 4
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  behavior:
    scaleDown: 
      policies: 
      - type: Pods 
        value: 4 
        periodSeconds: 60 
      - type: Percent
        value: 10 
        periodSeconds: 60
      selectPolicy: Min 
      stabilizationWindowSeconds: 300 
EOF

Edit the v1 version of the HPA object to set scaleUp as null in the annotation:

kubectl edit hpa.v1.autoscaling coredns-scaler -n kube-system

Change this annotation:

autoscaling.alpha.kubernetes.io/behavior: '{"ScaleUp":{"StabilizationWindowSeconds":0,"SelectPolicy":"Max","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":15},{"Type":"Percent","Value":100,"PeriodSeconds":15}]},"ScaleDown":{"StabilizationWindowSeconds":300,"SelectPolicy":"Min","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":60},{"Type":"Percent","Value":10,"PeriodSeconds":60}]}}'

To this:

autoscaling.alpha.kubernetes.io/behavior: '{"ScaleUp":null,"ScaleDown":{"StabilizationWindowSeconds":300,"SelectPolicy":"Min","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":60},{"Type":"Percent","Value":10,"PeriodSeconds":60}]}}'

Now when we fetch the v2beta2 object, the scaleUp spec is missing:

kubectl get hpa.v2beta2.autoscaling coredns-scaler -n kube-system -o yaml
...
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 4
      - periodSeconds: 60
        type: Percent
        value: 10
      selectPolicy: Min
      stabilizationWindowSeconds: 300
  maxReplicas: 5
  metrics:
...

KCM intermittently becomes unhealthy:

sh-4.2$ kubectl get cs

NAME                 STATUS      MESSAGE                                                                                       ERROR
controller-manager   Unhealthy   Get "http://127.0.0.1:10252/healthz": dial tcp 127.0.0.1:10252: connect: connection refused
scheduler            Healthy     ok
etcd-0               Healthy     {"health":"true"}

Checking KCM logs:

E1215 00:11:52.291834       9 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 2339 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41c0960, 0x747b350)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x41c0960, 0x747b350)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).stabilizeRecommendationWithBehaviors(0xc000d911e0, 0xc0012a7440, 0x1a, 0x0, 0xc001605bf0, 0x500000004, 0x10000
0004, 0x0, 0x7f4d5e8af368, 0x7f4d5e8b1fc0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:885 +0x3c
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).normalizeDesiredReplicasWithBehaviors(0xc000d911e0, 0xc0000fc700, 0xc0012a7440, 0x1a, 0x100000004, 0x4, 0x1)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:781 +0x127
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).reconcileAutoscaler(0xc000d911e0, 0xc00096be00, 0xc0012a7440, 0x1a, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:679 +0x2070
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).reconcileKey(0xc000d911e0, 0xc0012a7440, 0x1a, 0x3eca300, 0xc000d8ae48, 0x17113ec)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:371 +0x1b7
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).processNextWorkItem(0xc000d911e0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:225 +0xd8
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).worker(0xc000d911e0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:213 +0x2f
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001dd1100)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001dd1100, 0x512cde0, 0xc001dd90b0, 0x512df01, 0xc000e22d80)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001dd1100, 0x3b9aca00, 0x0, 0xc001dd2701, 0xc000e22d80)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001dd1100, 0x3b9aca00, 0xc000e22d80)                                                                                     /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:177 +0x245
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference                                                                                                      [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17b41bc]

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:23:18Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

shyamjvs · 2021-12-15T01:40:46Z

/sig autoscaling

pacoxu · 2021-12-15T02:52:40Z

/triage accepted

Haleygo · 2021-12-15T03:19:32Z

I would like to work on this.
/assign

NikhilSharmaWe · 2021-12-18T16:37:11Z

@natherz97 @pacoxu

Since func GenerateHPAScaleUpRules sets default values for nil values then why this error is showing.
Does this comment tells something about this behavior.

Guidance needed. Could you elaborate more about that what is the cause of this bug ?

NikhilSharmaWe · 2021-12-19T10:16:43Z

/assign

liggitt · 2022-01-04T21:10:42Z

cc @josephburnett

k8s-triage-robot · 2022-04-04T21:46:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-05-04T22:14:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-09T17:45:28Z

The issue has been marked as an important bug and triaged.
Such issues are automatically marked as frozen when hitting the rotten state
to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

NikhilSharmaWe · 2022-05-09T17:46:20Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-02-07T23:18:59Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

pacoxu · 2023-03-28T07:41:40Z

/help
The linked pr should be updated according to #107126 (comment).

k8s-ci-robot · 2023-03-28T07:41:41Z

@pacoxu:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
The linked pr should be updated according to #107126 (comment).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pacoxu · 2023-03-28T07:42:03Z

/remove-lifecycle rotten
/triage accepted

k8s-triage-robot · 2023-06-26T08:08:58Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

jkyros · 2024-01-26T02:52:19Z

@pacoxu, I know this is old, but:

testing with kube version 1.27+ seems to indicate this is fixed, I assume as a side effect of Autoscaling: advance v2 as the preferred API version over v1 #114358 -- now that we're always round-tripping back "out" to v2, the fields get filled in regardless before we use them so we don't panic anymore.
That also seems to protect us from Crash in podautoscaler.getLongestPolicyPeriod #116981 actually happening in the real world, but since we're fixing it at the API layer, that test case can still fuzz it "underneath" directly and panic.

Fixing this now and getting it back to pre-1.27 is more than 3 versions (and 1.26 is EOL next month), so I assume it would be pointless to try and fix the v1 API now by finishing #107126? -- since we couldn't backport it to a version that mattered anyway?

(As for why I'm interested, I have a downstream customer running < 1.27 that wanted it fixed and I was trying to see what I could do to accommodate that)

pacoxu · 2024-01-26T03:30:24Z

@pacoxu, I know this is old, but:

testing with kube version 1.27+ seems to indicate this is fixed, I assume as a side effect of Autoscaling: advance v2 as the preferred API version over v1 #114358 -- now that we're always round-tripping back "out" to v2, the fields get filled in regardless before we use them so we don't panic anymore.

Do you mean this is fixed and this issue can be closed for 1.27+?

Fixing this now and getting it back to pre-1.27 is more than 3 versions (and 1.26 is EOL next month), so I assume it would be pointless to try and fix the v1 API now by finishing #107126?
-- since we couldn't backport it to a version that mattered anyway?

Only v1.26 can be fixed in community.

(As for why I'm interested, I have a downstream customer running < 1.27 that wanted it fixed and I was trying to see what I could do to accommodate that)

I'm afraid that you have to maintain a specific version or use some webhook(or kyverno like https://kyverno.io/policies/other/add-default-resources/add-default-resources/) to make it not happen.

If you add a webhook to add default values for scaleup and scaledown, the panic will not happen. I didn't test it yet.

jkyros · 2024-01-26T21:52:44Z

Do you mean this is fixed and this issue can be closed for 1.27+?

Correct, this is fixed and the issue can be closed for 1.27+. I happened across that accidentally while testing a fix -- it was kind of a "wait, why can't I break this anymore" moment, and then I traced it back to when it changed -- but I did test it.

Only v1.26 can be fixed in community.

Thanks for confirming, I appreciate it.

I'm afraid that you have to maintain a specific version or use some webhook(or kyverno like https://kyverno.io/policies/other/add-default-resources/add-default-resources/) to make it not happen.

Thanks for your time and the suggestions -- I'm confident that would work ( I do a lot of work downstream on OpenShift, and we do have our own fork, I just try to avoid having too many "carry patches" if I can help it). Thanks!

natherz97 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 15, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2021

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 15, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2021

k8s-ci-robot assigned Haleygo Dec 15, 2021

k8s-ci-robot assigned NikhilSharmaWe Dec 19, 2021

NikhilSharmaWe mentioned this issue Dec 19, 2021

change the conversion logic for HPAScalingRules.StabilizationWindowSeconds to use default value instead of nil #107126

Closed

liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 4, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 9, 2022

Haleygo removed their assignment May 10, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 7, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 28, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 28, 2023

pacoxu mentioned this issue Mar 30, 2023

Crash in podautoscaler.getLongestPolicyPeriod #116981

Closed

pacoxu mentioned this issue May 22, 2023

Helm upgrade fails after upgrade hpa api version to autoscaling/v2 #118127

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 26, 2023

liggitt mentioned this issue Oct 17, 2023

Promote APF API to v1 #121089

Merged

jkyros mentioned this issue Jan 31, 2024

OCPBUGS-12210: Prevent partially filled HPA behaviors from crashing kube-controller-manager openshift/kubernetes#1876

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nil pointer dereference in KCM after v1 HPA patch request #107038

Nil pointer dereference in KCM after v1 HPA patch request #107038

natherz97 commented Dec 15, 2021

shyamjvs commented Dec 15, 2021

pacoxu commented Dec 15, 2021

Haleygo commented Dec 15, 2021

NikhilSharmaWe commented Dec 18, 2021 •

edited

Loading

NikhilSharmaWe commented Dec 19, 2021

liggitt commented Jan 4, 2022

k8s-triage-robot commented Apr 4, 2022

k8s-triage-robot commented May 4, 2022

k8s-triage-robot commented May 9, 2022

NikhilSharmaWe commented May 9, 2022

k8s-triage-robot commented Feb 7, 2023

pacoxu commented Mar 28, 2023

k8s-ci-robot commented Mar 28, 2023

pacoxu commented Mar 28, 2023

k8s-triage-robot commented Jun 26, 2023

jkyros commented Jan 26, 2024

pacoxu commented Jan 26, 2024

jkyros commented Jan 26, 2024

Nil pointer dereference in KCM after v1 HPA patch request #107038

Nil pointer dereference in KCM after v1 HPA patch request #107038

Comments

natherz97 commented Dec 15, 2021

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

shyamjvs commented Dec 15, 2021

pacoxu commented Dec 15, 2021

Haleygo commented Dec 15, 2021

NikhilSharmaWe commented Dec 18, 2021 • edited Loading

NikhilSharmaWe commented Dec 19, 2021

liggitt commented Jan 4, 2022

k8s-triage-robot commented Apr 4, 2022

k8s-triage-robot commented May 4, 2022

k8s-triage-robot commented May 9, 2022

NikhilSharmaWe commented May 9, 2022

k8s-triage-robot commented Feb 7, 2023

pacoxu commented Mar 28, 2023

k8s-ci-robot commented Mar 28, 2023

Guidelines

pacoxu commented Mar 28, 2023

k8s-triage-robot commented Jun 26, 2023

jkyros commented Jan 26, 2024

pacoxu commented Jan 26, 2024

jkyros commented Jan 26, 2024

NikhilSharmaWe commented Dec 18, 2021 •

edited

Loading