Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,51 @@ As part of [CI](../scripts/ci-e2e.sh) there is a [log collection script](hack/..
```bash
./hack/log/log-dump.sh
```
## Examples of troubleshooting real-world issues

### Nodes did not come online

If as a result of a new cluster create operation, or as a result of adding a new machine or machinepool resource to an existing cluster, one or more nodes did not join the cluster, you can use some of the above guidance to SSH into the VM(s) and debug what happened. First, let's find out which VMs were created but failed to join the cluster by introspecting all VMs in the cluster resource group, and comparing them to the nodes present in the cluster:

```
$ export CLUSTER_RESOURCE_GROUP=my-cluster-rg
$ export VM_PREFIX=my-cluster-md-0-
$ export KUBECONFIG=/Users/me/.kube/my-cluster.kubeconfig
$ for vm in $(az vm list -g $CLUSTER_RESOURCE_GROUP | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done
Error from server (NotFound): nodes "my-cluster-md-0-8qlrg" not found
node my-cluster-md-0-8qlrg did not join the cluster
```

The above assumes nodes as "machine" resources. If you're using "machinepool" resources:

```
$ export VMSS_NAME=$(az vmss list -g $CLUSTER_RESOURCE_GROUP | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name')
$ for vm in $(az vmss list-instances -g $CLUSTER_RESOURCE_GROUP -n $VMSS_NAME | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done
```

(The above uses the `az` command line tool to talk to Azure, and the `jq` utility to parse JSON output. Use your preferred toolchain following the general pattern.)

So, above we discover that the VM `my-cluster-md-0-8qlrg` is present in the resource group, but not as a node in the cluster. Let's hop on to the VM and look around.

We'll assume we have SSH access onto the control plane VM behind the apiserver [as described above](#Remoting-to-workload-clusters). Add the SSH private key to your local ssh client keychain so that you can log into any node from the control plane VM:

```
$ ssh-add -D
$ ssh-add ~/.ssh/my_private_key_rsa
$ ssh -A -i ~/.ssh/id_rsa capi@$(kubectl get azurecluster my-cluster -o jsonpath='{.status.network.apiServerIp.dnsName}')
capi@my-cluster-control-plane-68xfs:~$ ssh my-cluster-md-0-8qlrg
capi@my-cluster-md-0-8qlrg:~$
```

Now we're on the VM that didn't join the cluster. Let's look at the bootstrap logs on the cluster for error data.

```
capi@my-cluster-md-0-8qlrg:~$ less /var/lib/waagent/custom-script/download/0/stdout
<inspect VM bootstrap script data>
capi@my-cluster-md-0-8qlrg:~$ journalctl -u cloud-final
<inspect cloud-final systemd logs>
capi@my-cluster-md-0-8qlrg:~$ less /var/log/cloud-init-output.log
<inspect cloud-init data>
capi@my-cluster-md-0-8qlrg:~$ journalctl -u kubelet
<inspect kubelet systemd logs>
```