Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bad jsonpatch can knock out the CAPI controller manager #8059

Closed
DanielXiao opened this issue Feb 6, 2023 · 5 comments · Fixed by #8067
Closed

A bad jsonpatch can knock out the CAPI controller manager #8059

DanielXiao opened this issue Feb 6, 2023 · 5 comments · Fixed by #8067
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@DanielXiao
Copy link

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
Add a list variable to the ClusterClass

  - name: controlPlanePreKubeadmCommands
    required: false
    schema:
      openAPIV3Schema:
        type: array
        items:
          type: string
        default: []

Add the following inline patch

  - name: controlPlanePreKubeadmCommands
    definitions:
    - selector:
        apiVersion: controlplane.cluster.x-k8s.io/v1beta1
        kind: KubeadmControlPlaneTemplate
        matchResources:
          controlPlane: true
      jsonPatches:
      - op: replace
        path: /spec/template/spec/kubeadmConfigSpec/preKubeadmCommands
        valueFrom:
          template: |
            {{- range .controlPlanePreKubeadmCommands }}
            - {{ . }}
            {{- end }}

Omit controlPlanePreKubeadmCommands in your Cluster variables and create the Cluster. You will see CAPI controller manager Pod is down.

❯ kubectl get po -n capi-system
NAME                                      READY   STATUS             RESTARTS        AGE
capi-controller-manager-8778fdfcb-g57fk   0/1     CrashLoopBackOff   7 (2m21s ago)   16m

It crashes for this error. You can delete CAPI controller manager Pod for recreating but new Pod always hit this error.

I0206 02:15:26.100037       1 controller.go:117] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="topology/cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" cluster="default/workload-1" namespace="default" name="workload-1" reconcileID=e1cf4673-ef91-412e-94e5-42467d62e045
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1659c95]

goroutine 1328 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x1f4
panic({0x1c367c0, 0x322b990})
        runtime/panic.go:838 +0x207
github.com/evanphx/json-patch/v5.findObject(0x778d6d?, {0xc000a33a00?, 0x1aeab00?}, 0xc0011901a0?)
        github.com/evanphx/json-patch/[email protected]/patch.go:513 +0x155
github.com/evanphx/json-patch/v5.Patch.add({0xc0003e6800?, 0x16b6?, 0x16b6?}, 0xc001aff768?, 0xc0019d4880?, 0xc0015c8be8)
        github.com/evanphx/json-patch/[email protected]/patch.go:714 +0xae
github.com/evanphx/json-patch/v5.Patch.ApplyIndentWithOptions({0xc000747d10?, 0x6, 0x6}, {0xc0003e6800, 0x16b6, 0x16b6}, {0x0, 0x0}, 0xc00094fec0?)
        github.com/evanphx/json-patch/[email protected]/patch.go:1095 +0x1f9
github.com/evanphx/json-patch/v5.Patch.ApplyWithOptions(...)
        github.com/evanphx/json-patch/[email protected]/patch.go:1059
github.com/evanphx/json-patch/v5.Patch.Apply({0xc000747d10, 0x6, 0x6}, {0xc0003e6800, 0x16b6, 0x16b6})
        github.com/evanphx/json-patch/[email protected]/patch.go:1053 +0x8e
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches.applyPatchesToRequest({0x21fd060?, 0xc00107dc80?}, 0xc0010a6500, 0xc00152c180)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches/engine.go:310 +0x32e
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches.(*engine).Apply(0xc0003babc0, {0x21fd060, 0xc0016c4420}, 0xc0018a8120, 0xc0011665c0)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches/engine.go:101 +0x39b
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).computeDesiredState(0xc00011cb00, {0x21fd060, 0xc0016c4420}, 0xc0018a80c0)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/desired_state.go:103 +0x431
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcile(0xc000a8bdc0?, {0x21fd060, 0xc0016c4420}, 0xc0018a80c0)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:235 +0x14a
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).Reconcile(0xc00011cb00, {0x21fd060, 0xc0016c4420}, {{{0xc000c042e0?, 0x10?}, {0xc000c042d6?, 0x40d987?}}})
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:198 +0x445
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x21fcfb8?, {0x21fd060?, 0xc0016c4420?}, {{{0xc000c042e0?, 0x1de9760?}, {0xc000c042d6?, 0x4041f4?}}})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005d85a0, {0x21fcfb8, 0xc000991b00}, {0x1cc51e0?, 0xc0001c1820?})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320 +0x33c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005d85a0, {0x21fcfb8, 0xc000991b00})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:230 +0x325

You can not delete the problematic cluster object either since validating webhook is also in the crashed pod.

❯ kubectl delete cluster workload-1
Error from server (InternalError): Internal error occurred: failed calling webhook "validation.cluster.cluster.x-k8s.io": failed to call webhook: Post "https://capi-webhook-service.capi-system.svc:443/validate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s": dial tcp 100.68.35.5:443: connect: connection refused

There is no way to recover CAPI controller manager unless you delete CAPI's validatingwebhookconfiguration first.

What did you expect to happen:
CAPI controller manager Pod can tolerate errors from computing topology. Or provide a guidance for recovery.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version: v1.2.8
  • minikube/kind version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

Thx for the report.

I'll take a look.

@sbueringer
Copy link
Member

sbueringer commented Feb 6, 2023

@DanielXiao I was unable to reproduce the issue. I added the variable and patch but it didn't panic.

Note: I also had to add the following to KubeadmControlPlaneTemplate:

spec:
  template:
    spec:
      kubeadmConfigSpec:
        preKubeadmCommands: ["abc"]

Before that I just got the error that it's unable to find the key that it's suposed to replace (preKubeadmCommands).

How does your KubeadmControlPlaneTemplate look like? (at the moment where the patch is applied, to simplify this you could move the patch to the beginning of the patches array)

@sbueringer
Copy link
Member

Okay the panic occurs because after setting the array to nil there is another patch trying to insert an element at the beginning of the (nil) array.

I've opened an issue in the json patch repo to fix this specific panic: evanphx/json-patch#171

@sbueringer
Copy link
Member

Opened a PR to catch panics during applying patches: #8067

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants