diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 6d802073969..11153b596ee 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -60,3 +60,51 @@ As part of [CI](../scripts/ci-e2e.sh) there is a [log collection script](hack/.. ```bash ./hack/log/log-dump.sh ``` +## Examples of troubleshooting real-world issues + +### Nodes did not come online + +If as a result of a new cluster create operation, or as a result of adding a new machine or machinepool resource to an existing cluster, one or more nodes did not join the cluster, you can use some of the above guidance to SSH into the VM(s) and debug what happened. First, let's find out which VMs were created but failed to join the cluster by introspecting all VMs in the cluster resource group, and comparing them to the nodes present in the cluster: + +``` +$ export CLUSTER_RESOURCE_GROUP=my-cluster-rg +$ export VM_PREFIX=my-cluster-md-0- +$ export KUBECONFIG=/Users/me/.kube/my-cluster.kubeconfig +$ for vm in $(az vm list -g $CLUSTER_RESOURCE_GROUP | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done +Error from server (NotFound): nodes "my-cluster-md-0-8qlrg" not found +node my-cluster-md-0-8qlrg did not join the cluster +``` + +The above assumes nodes as "machine" resources. If you're using "machinepool" resources: + +``` +$ export VMSS_NAME=$(az vmss list -g $CLUSTER_RESOURCE_GROUP | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name') +$ for vm in $(az vmss list-instances -g $CLUSTER_RESOURCE_GROUP -n $VMSS_NAME | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done +``` + +(The above uses the `az` command line tool to talk to Azure, and the `jq` utility to parse JSON output. Use your preferred toolchain following the general pattern.) + +So, above we discover that the VM `my-cluster-md-0-8qlrg` is present in the resource group, but not as a node in the cluster. Let's hop on to the VM and look around. + +We'll assume we have SSH access onto the control plane VM behind the apiserver [as described above](#Remoting-to-workload-clusters). Add the SSH private key to your local ssh client keychain so that you can log into any node from the control plane VM: + +``` +$ ssh-add -D +$ ssh-add ~/.ssh/my_private_key_rsa +$ ssh -A -i ~/.ssh/id_rsa capi@$(kubectl get azurecluster my-cluster -o jsonpath='{.status.network.apiServerIp.dnsName}') +capi@my-cluster-control-plane-68xfs:~$ ssh my-cluster-md-0-8qlrg +capi@my-cluster-md-0-8qlrg:~$ +``` + +Now we're on the VM that didn't join the cluster. Let's look at the bootstrap logs on the cluster for error data. + +``` +capi@my-cluster-md-0-8qlrg:~$ less /var/lib/waagent/custom-script/download/0/stdout + +capi@my-cluster-md-0-8qlrg:~$ journalctl -u cloud-final + +capi@my-cluster-md-0-8qlrg:~$ less /var/log/cloud-init-output.log + +capi@my-cluster-md-0-8qlrg:~$ journalctl -u kubelet + +```