-
Notifications
You must be signed in to change notification settings - Fork 465
Update hostname when transient hostname is fedora #2160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update hostname when transient hostname is fedora #2160
Conversation
A recent change, perhaps the adoption of fcos 33 broke hostname assignment (at least in GCP) because the transient hostname was set to fedora on boot. We need to use the dhcp provided hostname in this case.
|
Hi @bdurrow. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test This indeed fixes OKD 4.6 install with F33 content on GCP (thanks!), however I'd prefer to find out why default hostname is now |
|
I agree this is not the best fix but is one that provides a workaround that I know works and I can apply as a MachineConfig patch. I think we all agree that |
|
@bdurrow: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest |
|
Flakes /retest |
I don't know what could be causing that. Does it only happen on GCP? Can you reproduce with just FCOS by itself (no OKD)? If you can maybe open an issue over at https://github.com/coreos/fedora-coreos-tracker. |
|
/retest |
|
Right, this happens on plain FCOS as well. Default hostname on AWS is |
|
In general, I think that this code-path is a hack already (and yeah, I started it). Dealing with |
|
I'd certainly prefer keeping |
I don't think that "effortlessly" is the correct metric. Fixing this issue in the best place is the better solution. For FCOS, ensuring that the hostname is set via Afterburn is an option that could be explored. And while FCOS having "fedora" as the hostname is not a problem per se, if this is happening for all GCP nodes with long names, I would submit we should fix FCOS. Note: the reason why RHCOS can't be fixed the same way as FCOS is due to RHCOS not running Afterburn on GCP. And since the RHCOS boot images are not updated (i.e. Afterburn/first boot would have already run) the MCO was the only reasonable place to address this bug. For FCOS, the story is very different. |
|
The issue happens on any GCP nodes, it affects F33-based machines due to systemd-resolved introduction. Not sure if Afterburn is supposed to fix the hostname after -resolved (but I don't mind really). Lets keep this on and continue in coreos/fedora-coreos-tracker#649 |
This happens for all gcp fcos nodes, not just ones ones with long names. |
|
Probably the simplest thing is to check not for |
|
Thanks for the report! And in this case I agree with @darkmuggle , I think the fix should get into FCOS and not the MCO. I'm not very comfortable expanding the hack to include more specific strings like this. |
|
The canonical Fedora change looks like https://src.fedoraproject.org/rpms/systemd/c/6eb8bcde288dda39b163e87ee0926f6f30fcad73 |
|
Thanks @lucab, I understand the afterburn issue as you describe it. Hopefully the the logic that we should set the static hostname to the node name to provide stability is sound. If it is, one possible upgrade solution would be to commit that change to the 4.5 branch and writing the upgrade graph such that an upgrade is required to that z stream version before upgrading to a version that changes the hostname logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formalizing my objections in a review.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bdurrow The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
To be clear, I'm only arguing for Afterburn to apply the GCP hostname -- that's the platform that's giving us fits.
@bdurrow that's a heavy hammer and one that I strongly oppose -- truncating the hostname is bad enough. Binding a dynamically assigned hostname could result in multiple nodes having the same truncated name much less the debug pain of seeing |
I'm not sure I understand your objection. If we bootstrap and get a certificate we can only update the node that matches that certificate. We don't configure a nodeName override in the kubelet config generated based on the rendered machine-config. If the machine boots with another hostname or kubelet restarts and the hostname has changed the machine can no longer update the node entry. Honestly it seems like we shouldn't mess with the hostname and instead define a stable nodeName with one of the following methods, choose your order or precedence:
I don't think that nodeName has the same limitation as the kernel's limit on the hostname so we don't have to do any truncation. The way it is done now means that the hostname is handled differently for OpenShift than other hosts on the same platform creating another edge case for ops to deal with. Users have not been able to build 4.6 GCP clusters for 2 weeks now while we discuss all of this and I still don't see any clear decision about how to revise my PR so it is accepted. It seems like someone needs to make a call about how it should be done. I'm happy to offer another PR but not without blessing of the method because it looks like yet again I'm contributing my time for free without my code getting accepted. (I made several PRs to openshift-ansible-contrib but they rewritten by team members and as a result I lost out any attribution.) |
Hmm indeed that's not cool - Github pioneered Co-authored-by for this reason and I'd expect OpenShift (and other) developers to use it. |
|
Yeah, you're certainly right in that it wouldn't work to try to change the hostname after the node is part of the cluster. That said we had a prior discussion about this in this PR: coreos/ignition-dracut#156 The basic argument against is that the hostname must be resolvable by other nodes - so "source of truth" for the naming hence must canonically be external. |
|
I am working on a revised PR that will include a function to test that the hostname is currently valid (with at least two elements with a dot between them) and use that for the conditional in question. If anyone sees a problem with that strategy then please let me know sooner rather than later. |
|
@bdurrow Thank you for your efforts here.
However, I asked the FCOS team to consider enabling Afterburn for GCP and leaving this template. Afterburn can set the hostname from meta-data and on GCP that is a source of truth. @lucab apparently has discussed this with the systemd team, and their advice is to avoid any magic strings like localhost or fedora. The better fix, IMO, is to ensure that FCOS gets a valid hostname. I would like to see this code path removed entirely, Having a sensible check seems better -- but getting that right may prove buggy when checking DNS hosts since the cluster DNS may not know. I understand that OKD is broken, but fixing FCOS is going to be the better path. |
|
There was a bug in the original code that would match any hostname with a localhost substring, i.e. valid-node.localhost.com. This code tries to address that bug and also provide logic that will work under almost all cases. The one exception that I know of is if there is a repeated dash in an element. I don't believe that a repeated dash (-) is allowed in most circumstances but this doesn't check for that. My test harness: My results: |
This sounds like a prudent option, do we have a tracking FCOS issue or BZ? |
|
|
Sorry for all the noise, it seems that I had a fundamental misunderstanding of how /etc/hostname is supposed to be used. Back in RHEL 5 the deployment guide said to put the fqdn (minus the final .) in /etc/hostname. Given that the kernel limits this to 63/64 characters and other distributions including more modern RHEL deployment guides it appears that this should always be just the hostname element, not a FQDN. To add to this challenge it appears that GCP sends a FQDN in the DHCP hostname field which NM will propagate to the hostname. This is not simply an issue with fedora but GCP will break the convention of only using a the first element as the hostname for all distributions. |
|
I'm closing this pull request because I it seems that there is another plan to resolve this issue that I am in a poor position to help with. If the chosen path is to have afterburn set the hostname in GCP with corrections for length MCO changes will have to be coordinated in with that change to the FCOS image. I would like to raise one last point. I think that afterburn should not conditionally apply length logic to how it munges the hostname. Doing so will quite likely lead to masters using one code path and workers another in at least some cases. |
@bdurrow there are plenty of constructive ways to help out with the FCOS community to help drive a fix forward. Again, my pushback was to push the resolution towards the better fix, in a better place, which should result in a more robust user experience. I would challenge you to engage with the FCOS community to find out how this issue can be resolved, and how you can help. |
|
Thanks @darkmuggle, I am not very interested in FCOS but I do hope to contribute to Openshift 4. I'm most interested in pushing this issue to PR but I can't figure out where to start: openshift/machine-api-operator#737. Suggestions? Resources? |
|
The proper solution to this will be https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/572 This is a temporary workaround that we should get in now. |
|
@LorbusChris: Reopened this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@darkmuggle @runcom would be ok carrying this temporary fix until the proper solution lands? |
|
/cherry-pick release-4.6 |
|
@LorbusChris: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/hold |
|
@bdurrow: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
The FCOS agreed-to path is currently inflight with a patch submitted. |
|
This is now superseded by the stopgap solution from #2217 |
|
@LorbusChris: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@LorbusChris, it seems unlikely that this PR is the path forward but if the people who opposed it before change their mind I am happy to make the changes you asked for. |
Rebase of #2158, A recent change, perhaps the adoption of fcos 33 broke hostname assignment (at least in GCP) because the transient hostname was set to fedora on boot. We need to use the dhcp provided hostname in this case.
- What I did
Modified templates/common/_base/files/etc-networkmanager-dispatcher.d-90-long-hostname.yaml so that the string fedora is treated like localhost when we check to see if the hostname was previously set.
- How to verify it
Build a 4.6 cluster in GCP.
- Description for the changelog
Use dhcp provided hostname when transient hostname is fedora