[cabpk] Write "/etc/hosts" and "/etc/hostname" #1580

akutz · 2019-08-12T18:44:07Z

/kind feature

Describe the solution you'd like
The bootstrap data should include the files /etc/hosts and /etc/hostname with the values derived from the host. This ensures these files reflect the correct values without requiring a reboot before the systemd-hostnamed service updates them.

The text was updated successfully, but these errors were encountered:

akutz · 2019-08-12T18:44:39Z

Related to kubernetes-sigs/cluster-api-provider-vsphere#474

detiber · 2019-08-12T20:24:25Z

@akutz I'm not sure I agree that the bootstrap data should always include /etc/hosts and /etc/hostname. Shouldn't we rely on the set_hostname, update_hostname, and update_etc_hosts modules to handle this?

vincepri · 2019-08-12T20:34:03Z

In addition to what @detiber mentioned, this functionality should be offered as an opt-in behavior that only some providers should use. AWS for example already sets these values internally on boot and doesn't need further intervention.

akutz · 2019-08-13T16:46:36Z

Hi @detiber / @vincepri,

Some background on this:

The kubelet in CAPV images was using the static host name instead of the transient (DHCP) host name
This is because the return value from the system call gethostname() is not always in sync with the value in /etc/hostname
This drift occurs when the /etc/hostname file has not yet been updated by systemd-hostnamed, a service that starts after cloud-init
The result was sudo calls that take obscenely long and kubelet node resolution that failed due to the value of /etc/hostname and /etc/hosts not reflecting the current, transient host name

All this was a problem on CAPV because of the fact that the host name provided to a guest may be affected by a DHCP server's advertised domain, causing the static host name to change to the transient, fully-qualified host name. This would be a problem on any cloud-init system that uses DHCP and is subject to transient host names.

However

It turns out all of this may have been avoided in the first place if the bug resolved by kubernetes-sigs/cluster-api-provider-vsphere#476 had not existed. It turns out that the cloud-provider was never actually resolving VMs via their UUIDs, and so the node resolution wasn't occurring against vSphere. This was a really subtle issue, and quite frankly, I'm as equally shocked that the bug wasn't caught sooner as I am that the bug was caught at all.

@ykakarap is building a custom version of CAPV that reverts kubernetes-sigs/cluster-api-provider-vsphere#474. Since CAPV now includes the cloud-provider fix (kubernetes-sigs/cluster-api-provider-vsphere#476), we think the host name fix in kubernetes-sigs/cluster-api-provider-vsphere#474 is no longer going to be necessary since the kubelet will be able to resolve the node by it's UUID.

I will close this issue if @ykakarap 's testing reveals the issue no longer occurs.

ykakarap · 2019-08-14T21:02:50Z

Test results:

By using a CAPV image that reverts kubernetes-sigs/cluster-api-provider-vsphere#474 clusterctl is NOT able to create a management cluster on vSphere.
A VM is created on vSphere but looks like there is an issue with kubelet on that VM.

Relevant kubelet logs:

Unable to register node "sc-rdops-vm06-dhcpv6-16d0" with API server: nodes "sc-rdops-vm06-dhcpv6-16d0" is forbidden: node "sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com" cannot modify node "sc-rdops-vm06-dhcpv6-16d0"

Relevant section of /var/run/cloud-init/instance-data.json:

"ds": {
  "meta_data": {
   "hostname": "sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com",
   "instance-id": "management-cluster-controlplane-1",
   "local-hostname": "sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com",
   "local-ipv4": "10.160.169.10",
    ...

Relevant section of /tmp/kubeadm.yaml:

nodeRegistration:
  criSocket: /var/run/containerd/containerd.sock
  kubeletExtraArgs:
    cloud-provider: vsphere
  name: 'sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com'

Content of /var/lib/cloud/data/set-hostname:

{
 "fqdn": "sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com",
 "hostname": "sc-rdops-vm06-dhcpv6-16d0"
}

Results of hostname and hostnamectl commands:

ubuntu@sc-rdops-vm06-dhcpv6-16d0:~$ hostname
sc-rdops-vm06-dhcpv6-16d0
ubuntu@sc-rdops-vm06-dhcpv6-16d0:~$ hostnamectl
   Static hostname: sc-rdops-vm06-dhcpv6-16d0
   Pretty hostname: localhost.local
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 19e6d06bff3744ce8fd7713ca9420d7f
           Boot ID: 3bc8c93b690b44858a7fa6e3e960134a
    Virtualization: vmware
  Operating System: Ubuntu 18.04.2 LTS
            Kernel: Linux 4.15.0-54-generic
      Architecture: x86-64

detiber · 2019-08-15T15:25:41Z

hmm, that output leads me to believe that the hostname is initially set to the fqdn (or possibly the ds.meta_data.local-hostname value) when the kubelet bootstrapping was initially started, and at some point later changes to the hostname defined in /var/lib/cloud/data/set-hostname.

@akutz @ykakarap are there possibly two processes trying to set the local hostname based on those different values?

akutz · 2019-08-15T15:43:58Z

Hi @detiber,

It's absolutely the case that ds.meta_data.local-hostname is modified once the network is online. Initially it would just be the host name sans domain, but once the networking is online, the network info is discovered again and then persisted back to instance-data.json to ensure the local IPv4 and IPv6 addresses are included.

akutz · 2019-08-15T15:46:03Z

What's interesting to me is that the instance ID is management-cluster-controlplane-1. That's also the machine name, and thus the host name. The DHCP server is completely overriding the entire host name with a new host FQDN instead of just appending an advertised domain. In other words:

The host name changes from

management-cluster-controlplane-1

to

sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com

It was never just sc-rdops-vm06-dhcpv6-16d0.

detiber · 2019-08-15T15:51:09Z

@akutz based on the output of the kubelet logs, it looks like it is indeed changed to sc-rdops-vm06-dhcpv6-16d0, which is also the value returned from hostname and hostnamectl as indicated in the output above.

That appears to be happening after the initial change from management-cluster-controlplane-1 to sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com, though.

akutz · 2019-08-15T15:55:09Z

Hi @detiber,

I'm sorry, I meant it wasn't ever just sc-rdops-vm06-dhcpv6-16d0 before knowing about the FQDN.

The way the host name commands work and are invoked in the cloud-init sources do show how the host name is set multiple ways in https://github.com/cloud-init/cloud-init/blob/2f3bb764626b9065f4102c7c0a67998a9c174444/cloudinit/distros/__init__.py#L218-L347.

What's happening is the hostname is set to sc-rdops-vm06-dhcpv6-16d0.eng.vmware.com, and then a call gethostname is used and just sc-rdops-vm06-dhcpv6-16d0 is returned because the call defaults to short instead of long.

akutz · 2019-08-15T15:57:31Z

I think what we actually need to do is update the vSphere cloud-init datasource here to ask cloud-init to update the hostname after DHCP is enabled to ensure the host files are updated before kubeadm is run. This ensures that calls to gethostname return the correct values because they will no longer be out of sync with the host files on disk.

akutz · 2019-08-15T15:58:47Z

Hmm, that or do not let this line overwrite the hostname that was provided via the metadata.

(which @ykakarap already told me we should do a long time ago)

ykakarap · 2019-08-15T16:15:12Z

Hmm, that or do not let this line overwrite the hostname that was provided via the metadata.

This will make sure that the hostname of the VM is always the machine name (like management-cluster-controlplane-1).
(FYI, I already tried this fix and it works)

Or, we can make sure to update the hostname properly before kubeadm runs as suggested here

update the vSphere cloud-init datasource here to ask cloud-init to update the hostname after DHCP is enabled

I think the choice boils down to if we want the machine's hostname to be the machine name or the FQDN.

I would vote for the hostname being set to the machine name as it would be consistent with the behavior in environments like VMC.

akutz · 2019-08-15T20:17:04Z

Hi @ykakarap,

I would vote for the hostname being set to the machine name

The problem with this, and correct me if I'm wrong @detiber, is when a DHCP server sets the host name option, the client (machine), unless configured not to, will use that name. This is what is happening in the Nimbus situation.

it would be consistent with the behavior in environments like VMC.

VMC's DHCP server simply doesn't set the host name option, so it's not a VMC specific issue. It's specific to DHCP.

This is once again a problem unique to the on-prem providers since the AWS instance data sets the host name to the actual name the host will have since AWS knows that in advance and includes it in the metadata for the AWS cloud-init datasource to persist as the instance data.

Whereas in a true DHCP environment, CAPV won't know the host name of the machine in advance.

detiber · 2019-08-15T20:18:47Z

I'm not sure how cloud provider integration works with CAPV, but in at least AWS and OpenStack we are limited in what the Node "Name" can be based on what is expected by the cloud provider integration to match with queries against the remote API.

akutz · 2019-08-15T20:19:38Z

I'm not sure how cloud provider integration works with CAPV, but in at least AWS and OpenStack we are limited in what the Node "Name" can be based on what is expected by the cloud provider integration to match with queries against the remote API.

That's not the issue here. It's that the DHCP server is overriding the host name.

chuckha · 2019-10-10T19:44:44Z

@akutz can we close this?

chuckha · 2019-10-24T14:00:09Z

Trying to figure this one out, but this sentence has me a bit stumped

The bootstrap data should include the files /etc/hosts and /etc/hostname with the values derived from the host.

It sounds like the files /etc/hosts and /etc/hostname need to be correctly populated with data only known by the host. If there is an external thing that can provide the data these files need, then perhaps the thing responsible for making the bootstrap data can get the necessary data from that external system.

However, if only the host knows what values need to be there, then either the host needs to exist before CABPK generates data and the user adds these files to the KubeadmConfigSpec or the user could inject a script to generate these files before kubeadm init runs.

not really sure where to put this, so i'm leaving it low priority in the current milestone. Please adjust if necessary, this is a best guess.

/milestone v0.3.0
/priority important-longterm

akutz · 2019-12-06T17:45:56Z

cc @yastij

We need to kick this out of the v1a3 milestone if this is no longer an issue. Yuvaraj had been previously working on this.

yastij · 2019-12-09T16:21:29Z

/assign

joonas · 2019-12-20T18:25:24Z

@yastij did you have any updates on this?

vincepri · 2020-02-12T20:12:56Z

@yastij Any updates?

Bumping to next for now.

/milestone Next
/priority awaiting-more-evidence

vincepri · 2020-03-11T16:04:03Z

Closing this due to lack of updates

/close

k8s-ci-robot · 2020-03-11T16:04:05Z

@vincepri: Closing this issue.

In response to this:

Closing this due to lack of updates

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ncdc changed the title ~~Write "/etc/hosts" and "/etc/hostname"~~ [cabpk] Write "/etc/hosts" and "/etc/hostname" Oct 16, 2019

ncdc transferred this issue from kubernetes-retired/cluster-api-bootstrap-provider-kubeadm Oct 16, 2019

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 16, 2019

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 24, 2019

k8s-ci-robot added this to the v0.3.0 milestone Oct 24, 2019

k8s-ci-robot assigned yastij Dec 9, 2019

akutz mentioned this issue Dec 12, 2019

construction of new machine name conflicts with hostname standards vmware-archive/cluster-api-upgrade-tool#146

Closed

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Feb 12, 2020

k8s-ci-robot modified the milestones: v0.3.0, Next Feb 12, 2020

vincepri removed the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Feb 12, 2020

k8s-ci-robot closed this as completed Mar 11, 2020

ykakarap mentioned this issue Aug 6, 2020

REQUEST: New membership for @ykakarap kubernetes/org#2084

Closed

6 tasks

akutz mentioned this issue Jul 28, 2021

Datasource for VMware canonical/cloud-init#953

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cabpk] Write "/etc/hosts" and "/etc/hostname" #1580

[cabpk] Write "/etc/hosts" and "/etc/hostname" #1580

akutz commented Aug 12, 2019

akutz commented Aug 12, 2019

detiber commented Aug 12, 2019

vincepri commented Aug 12, 2019

akutz commented Aug 13, 2019 •

edited

Loading

ykakarap commented Aug 14, 2019

detiber commented Aug 15, 2019 •

edited

Loading

akutz commented Aug 15, 2019

akutz commented Aug 15, 2019

detiber commented Aug 15, 2019

akutz commented Aug 15, 2019

akutz commented Aug 15, 2019 •

edited

Loading

akutz commented Aug 15, 2019 •

edited

Loading

ykakarap commented Aug 15, 2019

akutz commented Aug 15, 2019 •

edited

Loading

detiber commented Aug 15, 2019

akutz commented Aug 15, 2019

chuckha commented Oct 10, 2019

chuckha commented Oct 24, 2019

akutz commented Dec 6, 2019 •

edited

Loading

yastij commented Dec 9, 2019

joonas commented Dec 20, 2019

vincepri commented Feb 12, 2020

vincepri commented Mar 11, 2020

k8s-ci-robot commented Mar 11, 2020

[cabpk] Write "/etc/hosts" and "/etc/hostname" #1580

[cabpk] Write "/etc/hosts" and "/etc/hostname" #1580

Comments

akutz commented Aug 12, 2019

akutz commented Aug 12, 2019

detiber commented Aug 12, 2019

vincepri commented Aug 12, 2019

akutz commented Aug 13, 2019 • edited Loading

ykakarap commented Aug 14, 2019

detiber commented Aug 15, 2019 • edited Loading

akutz commented Aug 15, 2019

akutz commented Aug 15, 2019

detiber commented Aug 15, 2019

akutz commented Aug 15, 2019

akutz commented Aug 15, 2019 • edited Loading

akutz commented Aug 15, 2019 • edited Loading

ykakarap commented Aug 15, 2019

akutz commented Aug 15, 2019 • edited Loading

detiber commented Aug 15, 2019

akutz commented Aug 15, 2019

chuckha commented Oct 10, 2019

chuckha commented Oct 24, 2019

akutz commented Dec 6, 2019 • edited Loading

yastij commented Dec 9, 2019

joonas commented Dec 20, 2019

vincepri commented Feb 12, 2020

vincepri commented Mar 11, 2020

k8s-ci-robot commented Mar 11, 2020

akutz commented Aug 13, 2019 •

edited

Loading

detiber commented Aug 15, 2019 •

edited

Loading

akutz commented Aug 15, 2019 •

edited

Loading

akutz commented Aug 15, 2019 •

edited

Loading

akutz commented Aug 15, 2019 •

edited

Loading

akutz commented Dec 6, 2019 •

edited

Loading