Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot add agent node with a previously used hostname (that was deleted) #6054

Closed
binaryn3xus opened this issue Aug 30, 2022 · 10 comments
Closed

Comments

@binaryn3xus
Copy link

Environmental Info:
K3s Version: v1.24.3+k3s1

Node(s) CPU architecture, OS, and Version: Linux fleetcom-node4 5.15.0-46-generic #49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 3 servers, 1 agent (trying to add back one more agent)

Describe the bug:
Trying to add back a node that I removed to upgrade hardware in with the same hostname as before. It got a brand new hard drive and more ram (if that matters). I reinstalled Ubuntu 22.04, which is the version it was on before, and then proceeded to try to add it back to the cluster. While running the install script, it hangs at [INFO] systemd: Starting k3s-agent. Upon running systemctl status k3s-agent, I can see a log being repeated that reads as: Aug 30 14:00:20 fleetcom-node4 k3s[1697]: time="2022-08-30T14:00:20Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag

Steps To Reproduce:

  1. Get server token (using sudo cat /var/lib/rancher/k3s/server/node-token on master node)
  2. Run command: curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="v1.24.3+k3s1" K3S_URL=https://10.5.0.3:6443 K3S_TOKEN="<NODE-TOKEN>" sh -
  3. Runs and then hangs at systemd: Starting k3s-agent as stated in the 'Describe the Bug' section above.

Expected behavior:

I am trying to add this node back with the same hostname as before I removed it.
I would expect it to add back without issue since I drained the node and deleted it from the cluster before changing out the SSD.

Actual behavior:

Getting this log message: Aug 30 14:00:20 fleetcom-node4 k3s[1697]: time="2022-08-30T14:00:20Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password' may not match server node-passwd entry, try enabling a unique node name with the --with-node-id flag

Additional context / logs:

Before changing out the hardware, I ran a drain command and then kubectl delete fleetcom-node4.

@brandond
Copy link
Member

Can you confirm that the fleetcom-node4 node was successfully deleted from the cluster? The error message you're receiving indicates that it is still there.

@binaryn3xus
Copy link
Author

binaryn3xus commented Aug 30, 2022

Can you confirm that the fleetcom-node4 node was successfully deleted from the cluster? The error message you're receiving indicates that it is still there.

> kubectl get nodes
NAME             STATUS     ROLES                       AGE    VERSION
fleetcom-node1   Ready      control-plane,etcd,master   21d    v1.24.3+k3s1
fleetcom-node2   Ready      control-plane,etcd,master   21d    v1.24.3+k3s1
fleetcom-node3   Ready      control-plane,etcd,master   21d    v1.24.3+k3s1
fleetcom-node5   NotReady   <none>                      2d1h   v1.24.3+k3s1

(Node5 is a VM that I have powered off at the moment)

@brandond
Copy link
Member

Can you try kubectl get secret -n kube-system fleetcom-node4.node-password.k3s ? This secret should have been deleted when you removed the node from the cluster; if it still remains then delete it and try registering the node again.

@binaryn3xus
Copy link
Author

> kubectl get secret -n kube-system fleetcom-node4.node-password.k3s
Error from server (NotFound): secrets "fleetcom-node4.node-password.k3s" not found

> kubectl get secret -n kube-system
NAME                               TYPE                DATA   AGE
fleetcom-node1.node-password.k3s   Opaque              1      21d
fleetcom-node2.node-password.k3s   Opaque              1      21d
fleetcom-node3.node-password.k3s   Opaque              1      21d
fleetcom-node5.node-password.k3s   Opaque              1      2d1h
k3s-serving                        kubernetes.io/tls   2      21d

@brandond
Copy link
Member

And the node still will not successfully join the cluster? Do you see anything in the logs on the servers?

@binaryn3xus
Copy link
Author

Correct. I have looked for anything that might be helpful but maybe I am looking in the wrong places. Suggestions? So far I have found no clues to why I cant add it.

@brandond
Copy link
Member

brandond commented Aug 31, 2022

Can you attach the full logs from all 4 nodes? The 3 active servers and the node you're trying to join?

@binaryn3xus
Copy link
Author

binaryn3xus commented Aug 31, 2022

Files were large, you can get them from my Google Drive - k3s-logs

Had to get the service logs since there were none for the one I was trying to add.

@brandond
Copy link
Member

I wasn't able to find anything useful in the logs. In particular, I noticed that the logs on the servers stop at August 29th, while the only logs from the node that's failing to join are from the 31st. Makes it kind of hard to correlate anything.

If you're still fighting this, have you tried removing /etc/rancher/node/password from the node before rejoining it?

@binaryn3xus
Copy link
Author

@brandond I did try that. But honestly, this helped give me the kick in the pants I needed to setup something like Flux. And I decided that I should just go the route of adding the node id at the end of the name as well probably to help minimize similar issues to this.

So this issue can be closed unless it is something the project wants to continue to investigate.

@brandond brandond closed this as completed Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants