A bad jsonpatch can knock out the CAPI controller manager #8059

DanielXiao · 2023-02-06T03:02:40Z

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
Add a list variable to the ClusterClass

  - name: controlPlanePreKubeadmCommands
    required: false
    schema:
      openAPIV3Schema:
        type: array
        items:
          type: string
        default: []

Add the following inline patch

  - name: controlPlanePreKubeadmCommands
    definitions:
    - selector:
        apiVersion: controlplane.cluster.x-k8s.io/v1beta1
        kind: KubeadmControlPlaneTemplate
        matchResources:
          controlPlane: true
      jsonPatches:
      - op: replace
        path: /spec/template/spec/kubeadmConfigSpec/preKubeadmCommands
        valueFrom:
          template: |
            {{- range .controlPlanePreKubeadmCommands }}
            - {{ . }}
            {{- end }}

Omit controlPlanePreKubeadmCommands in your Cluster variables and create the Cluster. You will see CAPI controller manager Pod is down.

❯ kubectl get po -n capi-system
NAME                                      READY   STATUS             RESTARTS        AGE
capi-controller-manager-8778fdfcb-g57fk   0/1     CrashLoopBackOff   7 (2m21s ago)   16m

It crashes for this error. You can delete CAPI controller manager Pod for recreating but new Pod always hit this error.

I0206 02:15:26.100037       1 controller.go:117] "Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" controller="topology/cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" cluster="default/workload-1" namespace="default" name="workload-1" reconcileID=e1cf4673-ef91-412e-94e5-42467d62e045
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1659c95]

goroutine 1328 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0x1f4
panic({0x1c367c0, 0x322b990})
        runtime/panic.go:838 +0x207
github.com/evanphx/json-patch/v5.findObject(0x778d6d?, {0xc000a33a00?, 0x1aeab00?}, 0xc0011901a0?)
        github.com/evanphx/json-patch/[email protected]/patch.go:513 +0x155
github.com/evanphx/json-patch/v5.Patch.add({0xc0003e6800?, 0x16b6?, 0x16b6?}, 0xc001aff768?, 0xc0019d4880?, 0xc0015c8be8)
        github.com/evanphx/json-patch/[email protected]/patch.go:714 +0xae
github.com/evanphx/json-patch/v5.Patch.ApplyIndentWithOptions({0xc000747d10?, 0x6, 0x6}, {0xc0003e6800, 0x16b6, 0x16b6}, {0x0, 0x0}, 0xc00094fec0?)
        github.com/evanphx/json-patch/[email protected]/patch.go:1095 +0x1f9
github.com/evanphx/json-patch/v5.Patch.ApplyWithOptions(...)
        github.com/evanphx/json-patch/[email protected]/patch.go:1059
github.com/evanphx/json-patch/v5.Patch.Apply({0xc000747d10, 0x6, 0x6}, {0xc0003e6800, 0x16b6, 0x16b6})
        github.com/evanphx/json-patch/[email protected]/patch.go:1053 +0x8e
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches.applyPatchesToRequest({0x21fd060?, 0xc00107dc80?}, 0xc0010a6500, 0xc00152c180)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches/engine.go:310 +0x32e
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches.(*engine).Apply(0xc0003babc0, {0x21fd060, 0xc0016c4420}, 0xc0018a8120, 0xc0011665c0)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/patches/engine.go:101 +0x39b
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).computeDesiredState(0xc00011cb00, {0x21fd060, 0xc0016c4420}, 0xc0018a80c0)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/desired_state.go:103 +0x431
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).reconcile(0xc000a8bdc0?, {0x21fd060, 0xc0016c4420}, 0xc0018a80c0)
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:235 +0x14a
sigs.k8s.io/cluster-api/internal/controllers/topology/cluster.(*Reconciler).Reconcile(0xc00011cb00, {0x21fd060, 0xc0016c4420}, {{{0xc000c042e0?, 0x10?}, {0xc000c042d6?, 0x40d987?}}})
        sigs.k8s.io/cluster-api/internal/controllers/topology/cluster/cluster_controller.go:198 +0x445
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x21fcfb8?, {0x21fd060?, 0xc0016c4420?}, {{{0xc000c042e0?, 0x1de9760?}, {0xc000c042d6?, 0x4041f4?}}})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:121 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005d85a0, {0x21fcfb8, 0xc000991b00}, {0x1cc51e0?, 0xc0001c1820?})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:320 +0x33c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005d85a0, {0x21fcfb8, 0xc000991b00})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:273 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:230 +0x325

You can not delete the problematic cluster object either since validating webhook is also in the crashed pod.

❯ kubectl delete cluster workload-1
Error from server (InternalError): Internal error occurred: failed calling webhook "validation.cluster.cluster.x-k8s.io": failed to call webhook: Post "https://capi-webhook-service.capi-system.svc:443/validate-cluster-x-k8s-io-v1beta1-cluster?timeout=10s": dial tcp 100.68.35.5:443: connect: connection refused

There is no way to recover CAPI controller manager unless you delete CAPI's validatingwebhookconfiguration first.

What did you expect to happen:
CAPI controller manager Pod can tolerate errors from computing topology. Or provide a guidance for recovery.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: v1.2.8
minikube/kind version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-02-06T03:02:47Z

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2023-02-06T07:25:41Z

Thx for the report.

I'll take a look.

sbueringer · 2023-02-06T07:57:44Z

@DanielXiao I was unable to reproduce the issue. I added the variable and patch but it didn't panic.

Note: I also had to add the following to KubeadmControlPlaneTemplate:

spec:
  template:
    spec:
      kubeadmConfigSpec:
        preKubeadmCommands: ["abc"]

Before that I just got the error that it's unable to find the key that it's suposed to replace (preKubeadmCommands).

How does your KubeadmControlPlaneTemplate look like? (at the moment where the patch is applied, to simplify this you could move the patch to the beginning of the patches array)

sbueringer · 2023-02-06T09:27:10Z

Okay the panic occurs because after setting the array to nil there is another patch trying to insert an element at the beginning of the (nil) array.

I've opened an issue in the json patch repo to fix this specific panic: evanphx/json-patch#171

sbueringer · 2023-02-06T13:05:39Z

Opened a PR to catch panics during applying patches: #8067

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 6, 2023

sbueringer mentioned this issue Feb 6, 2023

🌱 ClusterClass: catch panics when applying patches #8067

Merged

k8s-ci-robot closed this as completed in #8067 Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A bad jsonpatch can knock out the CAPI controller manager #8059

A bad jsonpatch can knock out the CAPI controller manager #8059

DanielXiao commented Feb 6, 2023

k8s-ci-robot commented Feb 6, 2023

sbueringer commented Feb 6, 2023

sbueringer commented Feb 6, 2023 •

edited

Loading

sbueringer commented Feb 6, 2023

sbueringer commented Feb 6, 2023

A bad jsonpatch can knock out the CAPI controller manager #8059

A bad jsonpatch can knock out the CAPI controller manager #8059

Comments

DanielXiao commented Feb 6, 2023

k8s-ci-robot commented Feb 6, 2023

sbueringer commented Feb 6, 2023

sbueringer commented Feb 6, 2023 • edited Loading

sbueringer commented Feb 6, 2023

sbueringer commented Feb 6, 2023

sbueringer commented Feb 6, 2023 •

edited

Loading