Update hostname when transient hostname is fedora #2160

bdurrow · 2020-10-14T13:53:33Z

Rebase of #2158, A recent change, perhaps the adoption of fcos 33 broke hostname assignment (at least in GCP) because the transient hostname was set to fedora on boot. We need to use the dhcp provided hostname in this case.

- What I did
Modified templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml so that the string fedora is treated like localhost when we check to see if the hostname was previously set.

- How to verify it
Build a 4.6 cluster in GCP.

- Description for the changelog

Use dhcp provided hostname when transient hostname is fedora

A recent change, perhaps the adoption of fcos 33 broke hostname assignment (at least in GCP) because the transient hostname was set to fedora on boot. We need to use the dhcp provided hostname in this case.

openshift-ci-robot · 2020-10-14T13:53:57Z

Hi @bdurrow. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vrutkovs · 2020-10-14T14:01:07Z

/ok-to-test

This indeed fixes OKD 4.6 install with F33 content on GCP (thanks!), however I'd prefer to find out why default hostname is now fedora instead of localhost. @dustymabe do you know of any change which could have caused this?

bdurrow · 2020-10-14T14:08:48Z

I agree this is not the best fix but is one that provides a workaround that I know works and I can apply as a MachineConfig patch. I think we all agree that set-valid-hostname.sh is not the right solution to the problem is solves and is in itself a workaround. In my particular environment hostnames are 52 characters and I would rather use the default behavior of NetworkManager but it is harder to accomplish that than offer this workaround. That said, I agree it is preferable to change the behavior that caused the need for this patch.

openshift-ci-robot · 2020-10-14T15:36:36Z

@bdurrow: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-metal-ipi	`1e02d9f`	link	`/test e2e-metal-ipi`
ci/prow/e2e-aws	`1e02d9f`	link	`/test e2e-aws`
ci/prow/e2e-upgrade	`1e02d9f`	link	`/test e2e-upgrade`
ci/prow/okd-e2e-aws	`1e02d9f`	link	`/test okd-e2e-aws`
ci/prow/e2e-ovn-step-registry	`1e02d9f`	link	`/test e2e-ovn-step-registry`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vrutkovs · 2020-10-14T16:15:11Z

/retest
Flakes

vrutkovs · 2020-10-15T11:20:17Z

Flakes

/retest

dustymabe · 2020-10-16T13:49:45Z

This indeed fixes OKD 4.6 install with F33 content on GCP (thanks!), however I'd prefer to find out why default hostname is now fedora instead of localhost. @dustymabe do you know of any change which could have caused this?

I don't know what could be causing that. Does it only happen on GCP? Can you reproduce with just FCOS by itself (no OKD)? If you can maybe open an issue over at https://github.com/coreos/fedora-coreos-tracker.

vrutkovs · 2020-10-19T15:48:20Z

/retest

vrutkovs · 2020-10-20T10:15:50Z

Right, this happens on plain FCOS as well. Default hostname on AWS is localhost though

bdurrow · 2020-10-20T14:11:51Z

perhaps related? https://github.com/peckato1/fedora-systemd-package/commit/6eb8bcde288dda39b163e87ee0926f6f30fcad73

More discussion at coreos/fedora-coreos-tracker#649

darkmuggle · 2020-10-22T14:52:46Z

In general, I think that this code-path is a hack already (and yeah, I started it). Dealing with localhost is one thing, but I am leary about adding special cases. fedora could be either a default hostname or one configured by the cloud. localhost is universally agreed to mean the location for the loopback address (i.e. 127.0.0.0/8, ::1, etc).

vrutkovs · 2020-10-22T15:04:03Z

I'd certainly prefer keeping localhost, but can it be done effortlessly in FCOS?

darkmuggle · 2020-10-22T16:30:25Z

I'd certainly prefer keeping localhost, but can it be done effortlessly in FCOS?

I don't think that "effortlessly" is the correct metric. Fixing this issue in the best place is the better solution. For FCOS, ensuring that the hostname is set via Afterburn is an option that could be explored. And while FCOS having "fedora" as the hostname is not a problem per se, if this is happening for all GCP nodes with long names, I would submit we should fix FCOS.

Note: the reason why RHCOS can't be fixed the same way as FCOS is due to RHCOS not running Afterburn on GCP. And since the RHCOS boot images are not updated (i.e. Afterburn/first boot would have already run) the MCO was the only reasonable place to address this bug. For FCOS, the story is very different.

vrutkovs · 2020-10-22T16:38:08Z

The issue happens on any GCP nodes, it affects F33-based machines due to systemd-resolved introduction. Not sure if Afterburn is supposed to fix the hostname after -resolved (but I don't mind really).

Lets keep this on
/hold

and continue in coreos/fedora-coreos-tracker#649

bdurrow · 2020-10-22T16:40:28Z

And while FCOS having "fedora" as the hostname is not a problem per se, if this is happening for all GCP nodes with long names, I would submit we should fix FCOS.

This happens for all gcp fcos nodes, not just ones ones with long names.

cgwalters · 2020-10-22T16:46:11Z

Probably the simplest thing is to check not for localhost but whether the hostname resolves to 127.0.0.1/::1.

yuqi-zhang · 2020-10-22T16:48:01Z

Thanks for the report! And in this case I agree with @darkmuggle , I think the fix should get into FCOS and not the MCO. I'm not very comfortable expanding the hack to include more specific strings like this.

cgwalters · 2020-10-22T19:18:03Z

The canonical Fedora change looks like https://src.fedoraproject.org/rpms/systemd/c/6eb8bcde288dda39b163e87ee0926f6f30fcad73
(It's not FCOS specific)

bdurrow · 2020-10-26T16:57:26Z

Thanks @lucab, I understand the afterburn issue as you describe it. Hopefully the the logic that we should set the static hostname to the node name to provide stability is sound. If it is, one possible upgrade solution would be to commit that change to the 4.5 branch and writing the upgrade graph such that an upgrade is required to that z stream version before upgrading to a version that changes the hostname logic.

darkmuggle

Formalizing my objections in a review.

openshift-ci-robot · 2020-10-27T14:30:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bdurrow
To complete the pull request process, please assign yuqi-zhang after the PR has been reviewed.
You can assign the PR to them by writing /assign @yuqi-zhang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

darkmuggle · 2020-10-27T14:30:53Z

To be clear, I'm only arguing for Afterburn to apply the GCP hostname -- that's the platform that's giving us fits.

If we want the hostname to be a stable valid cluster name I think the solution is to set the static hostname instead of the transient one.

@bdurrow that's a heavy hammer and one that I strongly oppose -- truncating the hostname is bad enough. Binding a dynamically assigned hostname could result in multiple nodes having the same truncated name much less the debug pain of seeing foo.example.com and then trying to locate it on the network. At least with a truncated hostname -- and DHCP renewals - -- we don't get a hostname collision and a network admin can see that truncated hostname A really host A.

bdurrow · 2020-10-28T03:40:55Z

The question assumes that NM will only run the dispatcher once. But when you have an OVS VIF flapping, then the hostname could change. And since we need the hostname to stay that same for the life of the cluster or at least the boot....we have a guard here. The purpose of this specific check/hook is to provide a valid CLUSTER name, not to ensure a valid Linux hostname. Put another way, you can have a valid Linux hostname that's invalid for the cluster.

@bdurrow that's a heavy hammer and one that I strongly oppose -- truncating the hostname is bad enough. Binding a dynamically assigned hostname could result in multiple nodes having the same truncated name much less the debug pain of seeing foo.example.com and then trying to locate it on the network. At least with a truncated hostname -- and DHCP renewals - -- we don't get a hostname collision and a network admin can see that truncated hostname A really host A.

I'm not sure I understand your objection. If we bootstrap and get a certificate we can only update the node that matches that certificate. We don't configure a nodeName override in the kubelet config generated based on the rendered machine-config. If the machine boots with another hostname or kubelet restarts and the hostname has changed the machine can no longer update the node entry. Honestly it seems like we shouldn't mess with the hostname and instead define a stable nodeName with one of the following methods, choose your order or precedence:

nodeName recorded during bootstrap (not currently done but could be added)
The subject name on our certificate if it exists
Afterburn reported name from metadata (without truncation)
What the kernel reports as the hostname (and stop circumventing NM's DHCP process, just use what the kernel provides)
DNS PTR resolution

I don't think that nodeName has the same limitation as the kernel's limit on the hostname so we don't have to do any truncation.

The way it is done now means that the hostname is handled differently for OpenShift than other hosts on the same platform creating another edge case for ops to deal with.

Users have not been able to build 4.6 GCP clusters for 2 weeks now while we discuss all of this and I still don't see any clear decision about how to revise my PR so it is accepted. It seems like someone needs to make a call about how it should be done. I'm happy to offer another PR but not without blessing of the method because it looks like yet again I'm contributing my time for free without my code getting accepted. (I made several PRs to openshift-ansible-contrib but they rewritten by team members and as a result I lost out any attribution.)

cgwalters · 2020-10-28T13:55:46Z

(I made several PRs to openshift-ansible-contrib but they rewritten by team members and as a result I lost out any attribution.)

Hmm indeed that's not cool - Github pioneered Co-authored-by for this reason and I'd expect OpenShift (and other) developers to use it.

cgwalters · 2020-10-28T17:15:50Z

Yeah, you're certainly right in that it wouldn't work to try to change the hostname after the node is part of the cluster.

That said we had a prior discussion about this in this PR: coreos/ignition-dracut#156

The basic argument against is that the hostname must be resolvable by other nodes - so "source of truth" for the naming hence must canonically be external.

bdurrow · 2020-10-28T17:18:31Z

I am working on a revised PR that will include a function to test that the hostname is currently valid (with at least two elements with a dot between them) and use that for the conditional in question. If anyone sees a problem with that strategy then please let me know sooner rather than later.

darkmuggle · 2020-10-28T18:07:33Z

@bdurrow Thank you for your efforts here.

I am working on a revised PR that will include a function to test that the hostname is currently valid (with at least two elements with a dot between them) and use that for the conditional in question. If anyone sees a problem with that strategy then please let me know sooner rather than later.

However, I asked the FCOS team to consider enabling Afterburn for GCP and leaving this template. Afterburn can set the hostname from meta-data and on GCP that is a source of truth.

@lucab apparently has discussed this with the systemd team, and their advice is to avoid any magic strings like localhost or fedora.

The better fix, IMO, is to ensure that FCOS gets a valid hostname. I would like to see this code path removed entirely, Having a sensible check seems better -- but getting that right may prove buggy when checking DNS hosts since the cluster DNS may not know.

I understand that OKD is broken, but fixing FCOS is going to be the better path.

bdurrow · 2020-10-28T18:09:49Z

There was a bug in the original code that would match any hostname with a localhost substring, i.e. valid-node.localhost.com. This code tries to address that bug and also provide logic that will work under almost all cases. The one exception that I know of is if there is a repeated dash in an element. I don't believe that a repeated dash (-) is allowed in most circumstances but this doesn't check for that.

My test harness:

#!/bin/bash
set -euo pipefail
IFS=$'\n\t'

function validDomain() {
  local kn=$1
  local DOMAIN_ELEMENT_REGEX='([a-zA-Z0-9]{1,2}|[a-zA-Z0-9][a-zA-Z0-9\-]{1,61}[a-zA-Z0-9])'
  shopt -s extglob
  if [[ ! "${kn}" == "localhost.localdomain" ]] &&
     [[ "${kn}" =~ ^(${DOMAIN_ELEMENT_REGEX}\.)+${DOMAIN_ELEMENT_REGEX}.?$ ]] && 
     [[ "${#kn}" -le 63 ]]; then
    return 0
  else
    return 1
  fi
}

function domainAssert() {
  local domainName=$1
  local assertion=$2
  if [[ ${assertion} -eq 0 ]] && validDomain $domainName; then
    echo "PASS: $domainName is valid"
  elif [[ ${assertion} -eq 1 ]] && ! validDomain $domainName; then
    echo "PASS: $domainName is INVALID"
  else
    echo "FAIL: $domainName is $(validDomain ${domainName} && echo "valid" || echo "INVALID")"
  fi
}



domainAssert fedora 1
domainAssert debian 1
domainAssert localhost 1
domainAssert localhost.localdomain 1
domainAssert notrealylocalhost 1
domainAssert localhostnot 1
domainAssert localhost.not 0
domainAssert double--dash.com 1 #This should fail but not sure how to check for this with bash
domainAssert under_score.com 1
domainAssert "-foo.bar.com" 1
domainAssert "foo-.bar.com" 1
domainAssert xn--stackoverflow.com 0
domainAssert stackoverflow.xn--com 0
domainAssert stackoverflow.co.uk 0

My results:

PASS: fedora is INVALID
PASS: debian is INVALID
PASS: localhost is INVALID
PASS: localhost.localdomain is INVALID
PASS: notrealylocalhost is INVALID
PASS: localhostnot is INVALID
PASS: localhost.not is valid
FAIL: double--dash.com is valid
PASS: under_score.com is INVALID
PASS: -foo.bar.com is INVALID
PASS: foo-.bar.com is INVALID
PASS: xn--stackoverflow.com is valid
PASS: stackoverflow.xn--com is valid
PASS: stackoverflow.co.uk is valid

kikisdeliveryservice · 2020-10-28T18:24:55Z

I understand that OKD is broken, but fixing FCOS is going to be the better path.

This sounds like a prudent option, do we have a tracking FCOS issue or BZ?

cgwalters · 2020-10-28T18:27:09Z

This sounds like a prudent option, do we have a tracking FCOS issue or BZ?

https://bugzilla.redhat.com/show_bug.cgi?id=1892235

bdurrow · 2020-10-28T18:28:07Z

Sorry for all the noise, it seems that I had a fundamental misunderstanding of how /etc/hostname is supposed to be used. Back in RHEL 5 the deployment guide said to put the fqdn (minus the final .) in /etc/hostname. Given that the kernel limits this to 63/64 characters and other distributions including more modern RHEL deployment guides it appears that this should always be just the hostname element, not a FQDN. To add to this challenge it appears that GCP sends a FQDN in the DHCP hostname field which NM will propagate to the hostname. This is not simply an issue with fedora but GCP will break the convention of only using a the first element as the hostname for all distributions.

bdurrow · 2020-10-28T18:40:20Z

I'm closing this pull request because I it seems that there is another plan to resolve this issue that I am in a poor position to help with. If the chosen path is to have afterburn set the hostname in GCP with corrections for length MCO changes will have to be coordinated in with that change to the FCOS image.

I would like to raise one last point. I think that afterburn should not conditionally apply length logic to how it munges the hostname. Doing so will quite likely lead to masters using one code path and workers another in at least some cases.

darkmuggle · 2020-10-29T02:47:32Z

I'm closing this pull request because I it seems that there is another plan to resolve this issue that I am in a poor position to help with.

@bdurrow there are plenty of constructive ways to help out with the FCOS community to help drive a fix forward. Again, my pushback was to push the resolution towards the better fix, in a better place, which should result in a more robust user experience. I would challenge you to engage with the FCOS community to find out how this issue can be resolved, and how you can help.

bdurrow · 2020-10-29T02:52:19Z

Thanks @darkmuggle, I am not very interested in FCOS but I do hope to contribute to Openshift 4. I'm most interested in pushing this issue to PR but I can't figure out where to start: openshift/machine-api-operator#737. Suggestions? Resources?

LorbusChris · 2020-11-11T16:43:22Z

The proper solution to this will be https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/572
but I don't think we want to block the OKD 4.6 release til that lands.

This is a temporary workaround that we should get in now.
@bdurrow would you mind updating the commit message to reference the NM issue, and make clear this is temporary?
/reopen

openshift-ci-robot · 2020-11-11T16:43:39Z

@LorbusChris: Reopened this PR.

Details

In response to this:

The proper solution to this will be https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/572
but I don't think we want to block the OKD 4.6 release til that lands.

This is a temporary workaround that we should get in now.
@bdurrow would you mind updating the commit message to reference the NM issue, and make clear this is temporary?
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LorbusChris · 2020-11-11T16:44:30Z

@darkmuggle @runcom would be ok carrying this temporary fix until the proper solution lands?

LorbusChris · 2020-11-11T18:11:48Z

/cherry-pick release-4.6

openshift-cherrypick-robot · 2020-11-11T18:11:49Z

@LorbusChris: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

darkmuggle · 2020-11-11T19:18:12Z

/hold

openshift-merge-robot · 2020-11-11T21:14:39Z

@bdurrow: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-op	`1e02d9f`	link	`/test e2e-gcp-op`
ci/prow/e2e-aws	`1e02d9f`	link	`/test e2e-aws`
ci/prow/e2e-aws-serial	`1e02d9f`	link	`/test e2e-aws-serial`
ci/prow/e2e-ovn-step-registry	`1e02d9f`	link	`/test e2e-ovn-step-registry`
ci/prow/okd-e2e-aws	`1e02d9f`	link	`/test okd-e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

darkmuggle · 2020-11-11T22:44:21Z

The FCOS agreed-to path is currently inflight with a patch submitted.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677

LorbusChris · 2020-11-12T18:44:08Z

This is now superseded by the stopgap solution from #2217
/close

openshift-ci-robot · 2020-11-12T18:44:25Z

@LorbusChris: Closed this PR.

Details

In response to this:

This is now superseded by the stopgap solution from #2217
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bdurrow · 2020-11-14T20:53:47Z

@LorbusChris, it seems unlikely that this PR is the path forward but if the people who opposed it before change their mind I am happy to make the changes you asked for.

Update hostname when transient hostname is fedora

1e02d9f

A recent change, perhaps the adoption of fcos 33 broke hostname assignment (at least in GCP) because the transient hostname was set to fedora on boot. We need to use the dhcp provided hostname in this case.

openshift-ci-robot requested review from cgwalters and runcom October 14, 2020 13:53

openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 14, 2020

This was referenced Oct 14, 2020

Update hostname when transient hostname is fedora #2158

Closed

GCP Builds Fail since 4.6.0-0.okd-2020-10-08-055226 okd-project/okd#352

Closed

openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 14, 2020

vrutkovs mentioned this pull request Oct 15, 2020

[Tracking] OKD 4.6 Stable okd-project/okd#348

Closed

6 tasks

vrutkovs mentioned this pull request Oct 20, 2020

default hostname now is fedora, used to be localhost coreos/fedora-coreos-tracker#649

Closed

vrutkovs mentioned this pull request Oct 22, 2020

Bug 1887607: OVS config: check if OVS is installed #2164

Merged

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2020

darkmuggle suggested changes Oct 27, 2020

View reviewed changes

bdurrow closed this Oct 28, 2020

openshift-ci-robot reopened this Nov 11, 2020

openshift-ci-robot closed this Nov 12, 2020

Update hostname when transient hostname is fedora #2160

Update hostname when transient hostname is fedora #2160

Uh oh!

Conversation

bdurrow commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 14, 2020

Uh oh!

vrutkovs commented Oct 14, 2020

Uh oh!

bdurrow commented Oct 14, 2020

Uh oh!

openshift-ci-robot commented Oct 14, 2020

Uh oh!

vrutkovs commented Oct 14, 2020

Uh oh!

vrutkovs commented Oct 15, 2020

Uh oh!

dustymabe commented Oct 16, 2020

Uh oh!

vrutkovs commented Oct 19, 2020

Uh oh!

vrutkovs commented Oct 20, 2020

Uh oh!

bdurrow commented Oct 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darkmuggle commented Oct 22, 2020

Uh oh!

vrutkovs commented Oct 22, 2020

Uh oh!

darkmuggle commented Oct 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrutkovs commented Oct 22, 2020

Uh oh!

bdurrow commented Oct 22, 2020

Uh oh!

cgwalters commented Oct 22, 2020

Uh oh!

yuqi-zhang commented Oct 22, 2020

Uh oh!

cgwalters commented Oct 22, 2020

Uh oh!

bdurrow commented Oct 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darkmuggle left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Oct 27, 2020

Uh oh!

darkmuggle commented Oct 27, 2020

Uh oh!

bdurrow commented Oct 28, 2020

Uh oh!

cgwalters commented Oct 28, 2020

Uh oh!

cgwalters commented Oct 28, 2020

Uh oh!

bdurrow commented Oct 28, 2020

Uh oh!

darkmuggle commented Oct 28, 2020

Uh oh!

bdurrow commented Oct 28, 2020

Uh oh!

kikisdeliveryservice commented Oct 28, 2020

Uh oh!

cgwalters commented Oct 28, 2020

Uh oh!

bdurrow commented Oct 28, 2020

Uh oh!

bdurrow commented Oct 28, 2020

Uh oh!

darkmuggle commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdurrow commented Oct 29, 2020

bdurrow commented Oct 14, 2020 •

edited

Loading

bdurrow commented Oct 20, 2020 •

edited

Loading

darkmuggle commented Oct 22, 2020 •

edited

Loading

bdurrow commented Oct 26, 2020 •

edited

Loading

darkmuggle left a comment •

edited

Loading

darkmuggle commented Oct 29, 2020 •

edited

Loading