Skip to content

Conversation

@bdurrow
Copy link

@bdurrow bdurrow commented Oct 14, 2020

Rebase of #2158, A recent change, perhaps the adoption of fcos 33 broke hostname assignment (at least in GCP) because the transient hostname was set to fedora on boot. We need to use the dhcp provided hostname in this case.

- What I did
Modified templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml so that the string fedora is treated like localhost when we check to see if the hostname was previously set.

- How to verify it
Build a 4.6 cluster in GCP.

- Description for the changelog

Use dhcp provided hostname when transient hostname is fedora

A recent change, perhaps the adoption of fcos 33 broke hostname assignment (at least in GCP) because the transient hostname was set to fedora on boot.  We need to use the dhcp provided hostname in this case.
@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 14, 2020
@openshift-ci-robot
Copy link
Contributor

Hi @bdurrow. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vrutkovs
Copy link
Contributor

/ok-to-test

This indeed fixes OKD 4.6 install with F33 content on GCP (thanks!), however I'd prefer to find out why default hostname is now fedora instead of localhost. @dustymabe do you know of any change which could have caused this?

@openshift-ci-robot openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 14, 2020
@bdurrow
Copy link
Author

bdurrow commented Oct 14, 2020

I agree this is not the best fix but is one that provides a workaround that I know works and I can apply as a MachineConfig patch. I think we all agree that set-valid-hostname.sh is not the right solution to the problem is solves and is in itself a workaround. In my particular environment hostnames are 52 characters and I would rather use the default behavior of NetworkManager but it is harder to accomplish that than offer this workaround. That said, I agree it is preferable to change the behavior that caused the need for this patch.

@openshift-ci-robot
Copy link
Contributor

@bdurrow: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-metal-ipi 1e02d9f link /test e2e-metal-ipi
ci/prow/e2e-aws 1e02d9f link /test e2e-aws
ci/prow/e2e-upgrade 1e02d9f link /test e2e-upgrade
ci/prow/okd-e2e-aws 1e02d9f link /test okd-e2e-aws
ci/prow/e2e-ovn-step-registry 1e02d9f link /test e2e-ovn-step-registry

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@vrutkovs
Copy link
Contributor

/retest
Flakes

@vrutkovs
Copy link
Contributor

Flakes

/retest

@dustymabe
Copy link
Member

This indeed fixes OKD 4.6 install with F33 content on GCP (thanks!), however I'd prefer to find out why default hostname is now fedora instead of localhost. @dustymabe do you know of any change which could have caused this?

I don't know what could be causing that. Does it only happen on GCP? Can you reproduce with just FCOS by itself (no OKD)? If you can maybe open an issue over at https://github.com/coreos/fedora-coreos-tracker.

@vrutkovs
Copy link
Contributor

/retest

@vrutkovs
Copy link
Contributor

Right, this happens on plain FCOS as well. Default hostname on AWS is localhost though

@bdurrow
Copy link
Author

bdurrow commented Oct 20, 2020

@darkmuggle
Copy link

In general, I think that this code-path is a hack already (and yeah, I started it). Dealing with localhost is one thing, but I am leary about adding special cases. fedora could be either a default hostname or one configured by the cloud. localhost is universally agreed to mean the location for the loopback address (i.e. 127.0.0.0/8, ::1, etc).

@vrutkovs
Copy link
Contributor

I'd certainly prefer keeping localhost, but can it be done effortlessly in FCOS?

@darkmuggle
Copy link

darkmuggle commented Oct 22, 2020

I'd certainly prefer keeping localhost, but can it be done effortlessly in FCOS?

I don't think that "effortlessly" is the correct metric. Fixing this issue in the best place is the better solution. For FCOS, ensuring that the hostname is set via Afterburn is an option that could be explored. And while FCOS having "fedora" as the hostname is not a problem per se, if this is happening for all GCP nodes with long names, I would submit we should fix FCOS.

Note: the reason why RHCOS can't be fixed the same way as FCOS is due to RHCOS not running Afterburn on GCP. And since the RHCOS boot images are not updated (i.e. Afterburn/first boot would have already run) the MCO was the only reasonable place to address this bug. For FCOS, the story is very different.

@vrutkovs
Copy link
Contributor

The issue happens on any GCP nodes, it affects F33-based machines due to systemd-resolved introduction. Not sure if Afterburn is supposed to fix the hostname after -resolved (but I don't mind really).

Lets keep this on
/hold

and continue in coreos/fedora-coreos-tracker#649

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2020
@bdurrow
Copy link
Author

bdurrow commented Oct 22, 2020

And while FCOS having "fedora" as the hostname is not a problem per se, if this is happening for all GCP nodes with long names, I would submit we should fix FCOS.

This happens for all gcp fcos nodes, not just ones ones with long names.

@cgwalters
Copy link
Member

Probably the simplest thing is to check not for localhost but whether the hostname resolves to 127.0.0.1/::1.

@yuqi-zhang
Copy link
Contributor

Thanks for the report! And in this case I agree with @darkmuggle , I think the fix should get into FCOS and not the MCO. I'm not very comfortable expanding the hack to include more specific strings like this.

@cgwalters
Copy link
Member

The canonical Fedora change looks like https://src.fedoraproject.org/rpms/systemd/c/6eb8bcde288dda39b163e87ee0926f6f30fcad73
(It's not FCOS specific)

@bdurrow
Copy link
Author

bdurrow commented Oct 26, 2020

Thanks @lucab, I understand the afterburn issue as you describe it. Hopefully the the logic that we should set the static hostname to the node name to provide stability is sound. If it is, one possible upgrade solution would be to commit that change to the 4.5 branch and writing the upgrade graph such that an upgrade is required to that z stream version before upgrading to a version that changes the hostname logic.

Copy link

@darkmuggle darkmuggle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formalizing my objections in a review.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bdurrow
To complete the pull request process, please assign yuqi-zhang after the PR has been reviewed.
You can assign the PR to them by writing /assign @yuqi-zhang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@darkmuggle
Copy link

To be clear, I'm only arguing for Afterburn to apply the GCP hostname -- that's the platform that's giving us fits.

If we want the hostname to be a stable valid cluster name I think the solution is to set the static hostname instead of the transient one.

@bdurrow that's a heavy hammer and one that I strongly oppose -- truncating the hostname is bad enough. Binding a dynamically assigned hostname could result in multiple nodes having the same truncated name much less the debug pain of seeing foo.example.com and then trying to locate it on the network. At least with a truncated hostname -- and DHCP renewals - -- we don't get a hostname collision and a network admin can see that truncated hostname A really host A.

@bdurrow
Copy link
Author

bdurrow commented Oct 28, 2020

The question assumes that NM will only run the dispatcher once. But when you have an OVS VIF flapping, then the hostname could change. And since we need the hostname to stay that same for the life of the cluster or at least the boot....we have a guard here. The purpose of this specific check/hook is to provide a valid CLUSTER name, not to ensure a valid Linux hostname. Put another way, you can have a valid Linux hostname that's invalid for the cluster.

@bdurrow that's a heavy hammer and one that I strongly oppose -- truncating the hostname is bad enough. Binding a dynamically assigned hostname could result in multiple nodes having the same truncated name much less the debug pain of seeing foo.example.com and then trying to locate it on the network. At least with a truncated hostname -- and DHCP renewals - -- we don't get a hostname collision and a network admin can see that truncated hostname A really host A.

I'm not sure I understand your objection. If we bootstrap and get a certificate we can only update the node that matches that certificate. We don't configure a nodeName override in the kubelet config generated based on the rendered machine-config. If the machine boots with another hostname or kubelet restarts and the hostname has changed the machine can no longer update the node entry. Honestly it seems like we shouldn't mess with the hostname and instead define a stable nodeName with one of the following methods, choose your order or precedence:

  • nodeName recorded during bootstrap (not currently done but could be added)
  • The subject name on our certificate if it exists
  • Afterburn reported name from metadata (without truncation)
  • What the kernel reports as the hostname (and stop circumventing NM's DHCP process, just use what the kernel provides)
  • DNS PTR resolution

I don't think that nodeName has the same limitation as the kernel's limit on the hostname so we don't have to do any truncation.

The way it is done now means that the hostname is handled differently for OpenShift than other hosts on the same platform creating another edge case for ops to deal with.

Users have not been able to build 4.6 GCP clusters for 2 weeks now while we discuss all of this and I still don't see any clear decision about how to revise my PR so it is accepted. It seems like someone needs to make a call about how it should be done. I'm happy to offer another PR but not without blessing of the method because it looks like yet again I'm contributing my time for free without my code getting accepted. (I made several PRs to openshift-ansible-contrib but they rewritten by team members and as a result I lost out any attribution.)

@cgwalters
Copy link
Member

(I made several PRs to openshift-ansible-contrib but they rewritten by team members and as a result I lost out any attribution.)

Hmm indeed that's not cool - Github pioneered Co-authored-by for this reason and I'd expect OpenShift (and other) developers to use it.

@cgwalters
Copy link
Member

Yeah, you're certainly right in that it wouldn't work to try to change the hostname after the node is part of the cluster.

That said we had a prior discussion about this in this PR: coreos/ignition-dracut#156

The basic argument against is that the hostname must be resolvable by other nodes - so "source of truth" for the naming hence must canonically be external.

@bdurrow
Copy link
Author

bdurrow commented Oct 28, 2020

I am working on a revised PR that will include a function to test that the hostname is currently valid (with at least two elements with a dot between them) and use that for the conditional in question. If anyone sees a problem with that strategy then please let me know sooner rather than later.

@darkmuggle
Copy link

@bdurrow Thank you for your efforts here.

I am working on a revised PR that will include a function to test that the hostname is currently valid (with at least two elements with a dot between them) and use that for the conditional in question. If anyone sees a problem with that strategy then please let me know sooner rather than later.

However, I asked the FCOS team to consider enabling Afterburn for GCP and leaving this template. Afterburn can set the hostname from meta-data and on GCP that is a source of truth.

@lucab apparently has discussed this with the systemd team, and their advice is to avoid any magic strings like localhost or fedora.

The better fix, IMO, is to ensure that FCOS gets a valid hostname. I would like to see this code path removed entirely, Having a sensible check seems better -- but getting that right may prove buggy when checking DNS hosts since the cluster DNS may not know.

I understand that OKD is broken, but fixing FCOS is going to be the better path.

@bdurrow
Copy link
Author

bdurrow commented Oct 28, 2020

There was a bug in the original code that would match any hostname with a localhost substring, i.e. valid-node.localhost.com. This code tries to address that bug and also provide logic that will work under almost all cases. The one exception that I know of is if there is a repeated dash in an element. I don't believe that a repeated dash (-) is allowed in most circumstances but this doesn't check for that.

My test harness:

#!/bin/bash
set -euo pipefail
IFS=$'\n\t'

function validDomain() {
  local kn=$1
  local DOMAIN_ELEMENT_REGEX='([a-zA-Z0-9]{1,2}|[a-zA-Z0-9][a-zA-Z0-9\-]{1,61}[a-zA-Z0-9])'
  shopt -s extglob
  if [[ ! "${kn}" == "localhost.localdomain" ]] &&
     [[ "${kn}" =~ ^(${DOMAIN_ELEMENT_REGEX}\.)+${DOMAIN_ELEMENT_REGEX}.?$ ]] && 
     [[ "${#kn}" -le 63 ]]; then
    return 0
  else
    return 1
  fi
}

function domainAssert() {
  local domainName=$1
  local assertion=$2
  if [[ ${assertion} -eq 0 ]] && validDomain $domainName; then
    echo "PASS: $domainName is valid"
  elif [[ ${assertion} -eq 1 ]] && ! validDomain $domainName; then
    echo "PASS: $domainName is INVALID"
  else
    echo "FAIL: $domainName is $(validDomain ${domainName} && echo "valid" || echo "INVALID")"
  fi
}



domainAssert fedora 1
domainAssert debian 1
domainAssert localhost 1
domainAssert localhost.localdomain 1
domainAssert notrealylocalhost 1
domainAssert localhostnot 1
domainAssert localhost.not 0
domainAssert double--dash.com 1 #This should fail but not sure how to check for this with bash
domainAssert under_score.com 1
domainAssert "-foo.bar.com" 1
domainAssert "foo-.bar.com" 1
domainAssert xn--stackoverflow.com 0
domainAssert stackoverflow.xn--com 0
domainAssert stackoverflow.co.uk 0

My results:

PASS: fedora is INVALID
PASS: debian is INVALID
PASS: localhost is INVALID
PASS: localhost.localdomain is INVALID
PASS: notrealylocalhost is INVALID
PASS: localhostnot is INVALID
PASS: localhost.not is valid
FAIL: double--dash.com is valid
PASS: under_score.com is INVALID
PASS: -foo.bar.com is INVALID
PASS: foo-.bar.com is INVALID
PASS: xn--stackoverflow.com is valid
PASS: stackoverflow.xn--com is valid
PASS: stackoverflow.co.uk is valid

@kikisdeliveryservice
Copy link
Contributor

I understand that OKD is broken, but fixing FCOS is going to be the better path.

This sounds like a prudent option, do we have a tracking FCOS issue or BZ?

@cgwalters
Copy link
Member

This sounds like a prudent option, do we have a tracking FCOS issue or BZ?

https://bugzilla.redhat.com/show_bug.cgi?id=1892235

@bdurrow
Copy link
Author

bdurrow commented Oct 28, 2020

Sorry for all the noise, it seems that I had a fundamental misunderstanding of how /etc/hostname is supposed to be used. Back in RHEL 5 the deployment guide said to put the fqdn (minus the final .) in /etc/hostname. Given that the kernel limits this to 63/64 characters and other distributions including more modern RHEL deployment guides it appears that this should always be just the hostname element, not a FQDN. To add to this challenge it appears that GCP sends a FQDN in the DHCP hostname field which NM will propagate to the hostname. This is not simply an issue with fedora but GCP will break the convention of only using a the first element as the hostname for all distributions.

@bdurrow
Copy link
Author

bdurrow commented Oct 28, 2020

I'm closing this pull request because I it seems that there is another plan to resolve this issue that I am in a poor position to help with. If the chosen path is to have afterburn set the hostname in GCP with corrections for length MCO changes will have to be coordinated in with that change to the FCOS image.

I would like to raise one last point. I think that afterburn should not conditionally apply length logic to how it munges the hostname. Doing so will quite likely lead to masters using one code path and workers another in at least some cases.

@bdurrow bdurrow closed this Oct 28, 2020
@darkmuggle
Copy link

darkmuggle commented Oct 29, 2020

I'm closing this pull request because I it seems that there is another plan to resolve this issue that I am in a poor position to help with.

@bdurrow there are plenty of constructive ways to help out with the FCOS community to help drive a fix forward. Again, my pushback was to push the resolution towards the better fix, in a better place, which should result in a more robust user experience. I would challenge you to engage with the FCOS community to find out how this issue can be resolved, and how you can help.

@bdurrow
Copy link
Author

bdurrow commented Oct 29, 2020

Thanks @darkmuggle, I am not very interested in FCOS but I do hope to contribute to Openshift 4. I'm most interested in pushing this issue to PR but I can't figure out where to start: openshift/machine-api-operator#737. Suggestions? Resources?

@LorbusChris
Copy link
Contributor

The proper solution to this will be https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/572
but I don't think we want to block the OKD 4.6 release til that lands.

This is a temporary workaround that we should get in now.
@bdurrow would you mind updating the commit message to reference the NM issue, and make clear this is temporary?
/reopen

@openshift-ci-robot
Copy link
Contributor

@LorbusChris: Reopened this PR.

Details

In response to this:

The proper solution to this will be https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/572
but I don't think we want to block the OKD 4.6 release til that lands.

This is a temporary workaround that we should get in now.
@bdurrow would you mind updating the commit message to reference the NM issue, and make clear this is temporary?
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LorbusChris
Copy link
Contributor

@darkmuggle @runcom would be ok carrying this temporary fix until the proper solution lands?

@LorbusChris
Copy link
Contributor

/cherry-pick release-4.6

@openshift-cherrypick-robot

@LorbusChris: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@darkmuggle
Copy link

/hold

@openshift-merge-robot
Copy link
Contributor

@bdurrow: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-op 1e02d9f link /test e2e-gcp-op
ci/prow/e2e-aws 1e02d9f link /test e2e-aws
ci/prow/e2e-aws-serial 1e02d9f link /test e2e-aws-serial
ci/prow/e2e-ovn-step-registry 1e02d9f link /test e2e-ovn-step-registry
ci/prow/okd-e2e-aws 1e02d9f link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@darkmuggle
Copy link

The FCOS agreed-to path is currently inflight with a patch submitted.
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/677

@LorbusChris
Copy link
Contributor

This is now superseded by the stopgap solution from #2217
/close

@openshift-ci-robot
Copy link
Contributor

@LorbusChris: Closed this PR.

Details

In response to this:

This is now superseded by the stopgap solution from #2217
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bdurrow
Copy link
Author

bdurrow commented Nov 14, 2020

@LorbusChris, it seems unlikely that this PR is the path forward but if the people who opposed it before change their mind I am happy to make the changes you asked for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.