docs: add troubleshooting examples debugging missing nodes#831
docs: add troubleshooting examples debugging missing nodes#831jackfrancis wants to merge 3 commits into
Conversation
|
@jackfrancis: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Hi @jackfrancis. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
| $ export CLUSTER_RESOURCE_GROUP=my-cluster-rg | ||
| $ export VM_PREFIX=my-cluster-md-0- | ||
| $ export KUBECONFIG=/Users/me/.kube/my-cluster.kubeconfig | ||
| $ $ for vm in $(az vm list -g $CLUSTER_RESOURCE_GROUP | jq -r --arg VM_PREFIX "${VM_PREFIX}" '.[] | select (.name | startswith($VM_PREFIX)).name'); do kubectl get node $vm 2>&1 >/dev/null && continue || echo node $vm did not join the cluster; done |
There was a problem hiding this comment.
this only works for VMs, not vmss (ie. not MachinePool) right?
There was a problem hiding this comment.
Right. I'll add an equivalent example for VMSS.
There was a problem hiding this comment.
wouldn't kubectl get machines / kubectl get machinepool grep -v Ready have the same effect without needing az cli and calling to Azure? In your case did the machine show as Ready?
There was a problem hiding this comment.
Sorry, grep -v Running *
There was a problem hiding this comment.
I honestly didn't look at the machine resource at all. This was how my brain worked:
- create cluster w/ desired node count
- wait for nodes to come online, after a while noticed that there was one missing node
- how many actual VMs are in my resource group? it's 20
- O.K., so which one didn't register as a node?
FWIW
There was a problem hiding this comment.
Right, I think both are valid, you went to the RG as a first instinct because you're familiar with Azure. For users who aren't as comfortable with Azure specific stuff it'd be nice to document how all this stuff can be done without needing to care about the underlying infrastructure... kubectl get azuremachine should show you the VM status while kubectl get machine should show you the status of the machine from a k8s perspective.
There was a problem hiding this comment.
I think both perspectives are valid here. In reality will probably need to look at both CAPZ CRD and the underlying infra. Atleast that's how I have approached it so far, I look at the crd's for a quick understanding of what CAPZ thinks is going on then I look at the azure system to see what is happening. It would be worth add a small section to see how this vms in azure related to the CRD's of capz.
|
/assign @mboersma @jsturtevant for additional review |
|
Love the idea of giving use case driven debugging tips. My thoughts are if you use the CAPZ crd's you could get rid of the I use the ssh and map tool all the time when debuggin and would find something similiar useful too. I had this problem of nodes coming online just yesterday 😄 |
|
@jackfrancis once #901 merges, consider changing these instructions looking at boot diagnostics from the portal or the Azure CLI (https://docs.microsoft.com/en-us/cli/azure/vm/boot-diagnostics?view=azure-cli-latest#az-vm-boot-diagnostics-get-boot-log) |
|
@jackfrancis are you still planning on getting this one in? |
|
@jackfrancis: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@jackfrancis gentle nudge on this. We are happy to take over from current state if you would prefer. |
|
Rewrote this with new options in #1232 /close |
|
@CecileRobertMichon: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Burak Ok <burakok@microsoft.com>
What this PR does / why we need it:
This PR adds some supporting troubleshooting documentation to get a user started debugging why a cluster node did not come online.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: