Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nil pointer dereference in KCM after v1 HPA patch request #107038

Open
natherz97 opened this issue Dec 15, 2021 · 18 comments
Open

Nil pointer dereference in KCM after v1 HPA patch request #107038

natherz97 opened this issue Dec 15, 2021 · 18 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@natherz97
Copy link

What happened?

If we create a v2beta2 HPA object with fields specified for the scaleUp or scaleDown behavior, these values can be modified through the v1 version of the same object by changing the annotation for autoscaling.alpha.kubernetes.io/behavior. If we patch the scaleUp or scaleDown field in the annotation to null, the patch request will get accepted without the default values for scaleUp or scaleDown being applied. This will result in a panic in KCM when it tries to deference a nil pointer:

E1207 02:43:43.082791      10 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 2238 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41c0960, 0x747b350)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x41c0960, 0x747b350)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).stabilizeRecommendationWithBehaviors(0xc000a700b0, 0xc0004bd760, 0x1a, 0x0, 0xc001a4c570, 0x500000004, 0x100000004, 0x0, 0x7fe0a1a18988, 0x7fe0a1a26ef8, ...)
    /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:885 +0x3c
...

This is panic is occurring here: https://github.com/kubernetes/kubernetes/blob/v1.21.2/pkg/controller/podautoscaler/horizontal.go#L885

What did you expect to happen?

The patch request to the v1 HPA object should have applied the default value for the scaleUp or scaleDown fields when the annotation edit set it to null. When the v2beta2 object is updated, omitting the scaleUp or scaleDown properties will result in default values being applied so it is assumed that no fields in this struct are nil:

// We guarantee that no pointer in the structure will have the 'nil' value

How can we reproduce it (as minimally and precisely as possible)?

Using v1.21.2:

  1. Create an example HPA object using autoscaling/v2beta2
cat <<EOF | kubectl apply -n kube-system -f -
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: coredns-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: coredns
  minReplicas: 4
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  behavior:
    scaleDown: 
      policies: 
      - type: Pods 
        value: 4 
        periodSeconds: 60 
      - type: Percent
        value: 10 
        periodSeconds: 60
      selectPolicy: Min 
      stabilizationWindowSeconds: 300 
EOF
  1. Edit the v1 version of the HPA object to set scaleUp as null in the annotation:
kubectl edit hpa.v1.autoscaling coredns-scaler -n kube-system

Change this annotation:

autoscaling.alpha.kubernetes.io/behavior: '{"ScaleUp":{"StabilizationWindowSeconds":0,"SelectPolicy":"Max","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":15},{"Type":"Percent","Value":100,"PeriodSeconds":15}]},"ScaleDown":{"StabilizationWindowSeconds":300,"SelectPolicy":"Min","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":60},{"Type":"Percent","Value":10,"PeriodSeconds":60}]}}'

To this:

autoscaling.alpha.kubernetes.io/behavior: '{"ScaleUp":null,"ScaleDown":{"StabilizationWindowSeconds":300,"SelectPolicy":"Min","Policies":[{"Type":"Pods","Value":4,"PeriodSeconds":60},{"Type":"Percent","Value":10,"PeriodSeconds":60}]}}'
  1. Now when we fetch the v2beta2 object, the scaleUp spec is missing:
kubectl get hpa.v2beta2.autoscaling coredns-scaler -n kube-system -o yaml
...
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 60
        type: Pods
        value: 4
      - periodSeconds: 60
        type: Percent
        value: 10
      selectPolicy: Min
      stabilizationWindowSeconds: 300
  maxReplicas: 5
  metrics:
...
  1. KCM intermittently becomes unhealthy:
sh-4.2$ kubectl get cs

NAME                 STATUS      MESSAGE                                                                                       ERROR
controller-manager   Unhealthy   Get "http://127.0.0.1:10252/healthz": dial tcp 127.0.0.1:10252: connect: connection refused
scheduler            Healthy     ok
etcd-0               Healthy     {"health":"true"}

Checking KCM logs:

E1215 00:11:52.291834       9 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 2339 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.logPanic(0x41c0960, 0x747b350)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x41c0960, 0x747b350)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).stabilizeRecommendationWithBehaviors(0xc000d911e0, 0xc0012a7440, 0x1a, 0x0, 0xc001605bf0, 0x500000004, 0x10000
0004, 0x0, 0x7f4d5e8af368, 0x7f4d5e8b1fc0, ...)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:885 +0x3c
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).normalizeDesiredReplicasWithBehaviors(0xc000d911e0, 0xc0000fc700, 0xc0012a7440, 0x1a, 0x100000004, 0x4, 0x1)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:781 +0x127
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).reconcileAutoscaler(0xc000d911e0, 0xc00096be00, 0xc0012a7440, 0x1a, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:679 +0x2070
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).reconcileKey(0xc000d911e0, 0xc0012a7440, 0x1a, 0x3eca300, 0xc000d8ae48, 0x17113ec)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:371 +0x1b7
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).processNextWorkItem(0xc000d911e0, 0x0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:225 +0xd8
k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).worker(0xc000d911e0)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:213 +0x2f
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001dd1100)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001dd1100, 0x512cde0, 0xc001dd90b0, 0x512df01, 0xc000e22d80)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001dd1100, 0x3b9aca00, 0x0, 0xc001dd2701, 0xc000e22d80)
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001dd1100, 0x3b9aca00, 0xc000e22d80)                                                                                     /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).Run
        /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:177 +0x245
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference                                                                                                      [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x17b41bc]

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:23:18Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@natherz97 natherz97 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 15, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2021
@shyamjvs
Copy link
Member

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 15, 2021
@pacoxu
Copy link
Member

pacoxu commented Dec 15, 2021

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2021
@Haleygo
Copy link

Haleygo commented Dec 15, 2021

I would like to work on this.
/assign

@NikhilSharmaWe
Copy link
Member

NikhilSharmaWe commented Dec 18, 2021

@natherz97 @pacoxu

  • Since func GenerateHPAScaleUpRules sets default values for nil values then why this error is showing.
  • Does this comment tells something about this behavior.

Guidance needed. Could you elaborate more about that what is the cause of this bug ?

@NikhilSharmaWe
Copy link
Member

/assign

@liggitt
Copy link
Member

liggitt commented Jan 4, 2022

cc @josephburnett

@liggitt liggitt added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 4, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 4, 2022
@k8s-triage-robot
Copy link

The issue has been marked as an important bug and triaged.
Such issues are automatically marked as frozen when hitting the rotten state
to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 9, 2022
@NikhilSharmaWe
Copy link
Member

/remove-lifecycle rotten

@Haleygo Haleygo removed their assignment May 10, 2022
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 7, 2023
@pacoxu
Copy link
Member

pacoxu commented Mar 28, 2023

/help
The linked pr should be updated according to #107126 (comment).

@k8s-ci-robot
Copy link
Contributor

@pacoxu:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help
The linked pr should be updated according to #107126 (comment).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 28, 2023
@pacoxu
Copy link
Member

pacoxu commented Mar 28, 2023

/remove-lifecycle rotten
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 28, 2023
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jun 26, 2023
@jkyros
Copy link

jkyros commented Jan 26, 2024

@pacoxu, I know this is old, but:

Fixing this now and getting it back to pre-1.27 is more than 3 versions (and 1.26 is EOL next month), so I assume it would be pointless to try and fix the v1 API now by finishing #107126? -- since we couldn't backport it to a version that mattered anyway?

(As for why I'm interested, I have a downstream customer running < 1.27 that wanted it fixed and I was trying to see what I could do to accommodate that)

@pacoxu
Copy link
Member

pacoxu commented Jan 26, 2024

@pacoxu, I know this is old, but:

Do you mean this is fixed and this issue can be closed for 1.27+?

Fixing this now and getting it back to pre-1.27 is more than 3 versions (and 1.26 is EOL next month), so I assume it would be pointless to try and fix the v1 API now by finishing #107126?
-- since we couldn't backport it to a version that mattered anyway?

Only v1.26 can be fixed in community.

(As for why I'm interested, I have a downstream customer running < 1.27 that wanted it fixed and I was trying to see what I could do to accommodate that)

I'm afraid that you have to maintain a specific version or use some webhook(or kyverno like https://kyverno.io/policies/other/add-default-resources/add-default-resources/) to make it not happen.

  • If you add a webhook to add default values for scaleup and scaledown, the panic will not happen. I didn't test it yet.

@jkyros
Copy link

jkyros commented Jan 26, 2024

Do you mean this is fixed and this issue can be closed for 1.27+?

Correct, this is fixed and the issue can be closed for 1.27+. I happened across that accidentally while testing a fix -- it was kind of a "wait, why can't I break this anymore" moment, and then I traced it back to when it changed -- but I did test it.

Only v1.26 can be fixed in community.

Thanks for confirming, I appreciate it.

I'm afraid that you have to maintain a specific version or use some webhook(or kyverno like https://kyverno.io/policies/other/add-default-resources/add-default-resources/) to make it not happen.

Thanks for your time and the suggestions -- I'm confident that would work ( I do a lot of work downstream on OpenShift, and we do have our own fork, I just try to avoid having too many "carry patches" if I can help it). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
9 participants