Skip to content

validate node hostnames when they are being created or updated#46892

Closed
capnspacehook wants to merge 7 commits intomasterfrom
capnspacehook/validate-node-hostnames;
Closed

validate node hostnames when they are being created or updated#46892
capnspacehook wants to merge 7 commits intomasterfrom
capnspacehook/validate-node-hostnames;

Conversation

@capnspacehook
Copy link
Copy Markdown
Contributor

@capnspacehook capnspacehook commented Sep 24, 2024

See #46892 (comment) on why the validation logic was changed from previously using utils.IsValidHostname.

Updates https://github.com/gravitational/teleport-private/issues/1676.

changelog: validate node hostnames when they are being created or updated

@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from b6945ce to 886e7ec Compare September 30, 2024 20:21
@capnspacehook
Copy link
Copy Markdown
Contributor Author

@marcoandredinis do you think this could interfere/break EC2 discovery? I don't think so as the DNS names that are assigned to EC2 nodes always seem to be valid hostnames. It looks like when creating an EC2 instance with RunInstances the hostname can be set to be built from the IPv4 address or the instance ID, both of which will create valid hostnames as far as I can tell.

@marcoandredinis
Copy link
Copy Markdown
Contributor

@marcoandredinis do you think this could interfere/break EC2 discovery? I don't think so as the DNS names that are assigned to EC2 nodes always seem to be valid hostnames. It looks like when creating an EC2 instance with RunInstances the hostname can be set to be built from the IPv4 address or the instance ID, both of which will create valid hostnames as far as I can tell.

I don't think this will break EC2 discovery.
We were also using the TeleportHostname tag's value in case it existed, and validating that value is useful as well.

Comment thread lib/services/local/presence.go Outdated
@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from 62857a7 to 6808937 Compare October 10, 2024 16:20
Comment thread lib/inventory/controller.go Outdated
@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from cfb23de to 31350e2 Compare October 19, 2024 00:35
@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from 31350e2 to 1475874 Compare October 30, 2024 18:03
@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from 8394fe5 to 897a528 Compare October 30, 2024 19:27
@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from 897a528 to 04aead8 Compare October 30, 2024 20:13
@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from 04aead8 to b41998e Compare October 30, 2024 20:20
@capnspacehook
Copy link
Copy Markdown
Contributor Author

capnspacehook commented Oct 30, 2024

It turns out setting node hostnames to UUIDs is pretty common among our tests and codebase in general, so I added a function that will specifically validate node hostnames, as utils.IsValidHostname strictly checks if a hostname was a valid domain name as decided by RFC 1035 which is a bit too strict for what we need.

Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What impact will this have on any existing nodes with a now invalid hostname? Will they fail to heartbeat and disappear from the inventory altogether? How will cluster admins be able to identify these invalid hosts? What recourse will they have to resolve the problem if Teleport is their only means to access the now offline hosts?

@capnspacehook
Copy link
Copy Markdown
Contributor Author

capnspacehook commented Oct 30, 2024

The checks were added such that nodes with hostnames that were previously allowed shouldn't be affected at all, the checks are only done at the heartbeat input point EICE discovery point and the manual creation input point.

I tested this before, but in testing again found a check at the heartbeat input point triggered errors for existing nodes @rosstimothy.

@rosstimothy
Copy link
Copy Markdown
Contributor

I'm not convinced the current state of this PR addresses all the possible ways the linked issue could be exploited. If you want to merge as is, that's fine, but I don't think we should close the linked issue.

@fspmarshall
Copy link
Copy Markdown
Contributor

+1 to current state not fully resolving the issue since it still permits malformed hostnames to be added by either token registration or changing the hostname of an extant agent.

I don't see us solving those cases in a simple/performant manner. A reasonable compromise might be to backport the current state (no new openssh/discovery nodes), along with a notification that goes something like:

node "<hostname>" is configured with a malformed hostname. future versions of teleport will evict nodes with malformed hostnames. please update it to use a hostname consisting of <allowed-characters>.

Then, merge hard rejection of all invalid hostnames for master/v18+. The linked issue is (IMO) fairly low severity, so I don't think giving folks a major version cycle to migrate their existing agents would be unreasonable.

@capnspacehook
Copy link
Copy Markdown
Contributor Author

Got approval from @r0mant issue a warning in v17 and hard reject invalid hostnames in v18. Added a warning here, will open a PR to add to the v17 changelog after this is merged as well and another PR to master that will hard reject invalid hostnames.

Copy link
Copy Markdown
Contributor

@fspmarshall fspmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a periodic that checks for invalid hostnames and generates a notification to improve visibility in affected clusters. That could easily be part of followup work though.

Comment thread lib/services/local/presence.go
Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have test coverage outside of the utils package to verify invalid hostnames are forbidden going forward and prevent any regressions.

@fspmarshall
Copy link
Copy Markdown
Contributor

Sorry for throwing yet another wrench into this process, but @rosstimothy and I were chatting about this elsewhere and had some more thoughts on how we might make this a bit of a "safer" transition:

The whole evicting/rejecting nodes thing is scary. Even with notifications/warnings and a full major version, someone who is on an older version and upgrading fast might be blindsided by nodes dropping out unexpectedly. What if we don't evict/reject invalid hostnames at all, and instead come up with a scheme for keeping the agents dialable while also refusing to advertise invalid hostnames?

Just ignoring/omitting the invalid hostname is sketchy, as it leaves blank entries in the UI/tsh ls, and escaping is a bit of a non-starter because it might inadvertently introduce unexpected collisions. But, we could replace invalid hostnames with a placeholder value so that the nodes continue to show up. So long as the placeholder value has a sufficient random element, the nodes can still be dialed by hostname without any collisions or other unexpected errors. E.g. maybe we replace an invalid hostname with something like invalid-hostname-<random-suffix>. We could even write the original hostname to a label (e.g. "teleport.internal/invalid-hostname": "foo bar") for forensic purposes.

With the above (or something like it), we should be able to fully address invalid hostnames with no loss of access or information.

@capnspacehook what do you think? I know we've had a lot of back and forth about this issue, and I'm very sorry about that, but I think this is likely a much better path forward than anything we've considered thus far.

@capnspacehook capnspacehook force-pushed the capnspacehook/validate-node-hostnames; branch from 5748aee to d5fb2fe Compare November 1, 2024 19:11
@capnspacehook
Copy link
Copy Markdown
Contributor Author

Thanks for the idea @fspmarshall, that's a much better approach than evicting nodes. I added a periodic global notification sent to users that can update nodes, let me know if I need to change anything.

@capnspacehook
Copy link
Copy Markdown
Contributor Author

@rosstimothy added regression test

Comment on lines +952 to +953
err := utils.ValidateNodeHostname(eiceNode.GetHostname())
if err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: reduce the error scope

Suggested change
err := utils.ValidateNodeHostname(eiceNode.GetHostname())
if err != nil {
if err := utils.ValidateNodeHostname(eiceNode.GetHostname()); err != nil {

return nil, trace.BadParameter("cannot place node in namespace %q, custom namespaces are deprecated", n)
}
if err := utils.ValidateNodeHostname(server.GetHostname()); err != nil {
s.log.Warnf(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be misleading if this hostname violates the newly enforced length restriction. Perhaps the message here should be a bit more vague and include the error message instead?

Comment thread lib/auth/auth.go
Comment on lines +1475 to +1476
if err := utils.ValidateNodeHostname(srv.GetHostname()); err != nil {
invalidNodeHostnames = append(invalidNodeHostnames, srv.GetHostname())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should cap the number of invalid nodes added here to protect against consuming too much memory(what if all my nodes now have an invalid hostname 😨?) and to reduce overwhelming consumers of the notification. If the notification content is too large, we may be unable to persist the notification, and would also be hard for a user to digest. We probably want some happy medium since we also don't want one notification for each invalid hostname.

}
if hostname := s.GetHostname(); hostname != "" {
if err := utils.ValidateNodeHostname(hostname); err != nil {
return nil, trace.Wrap(err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean toward never hard rejecting any invalid hostnames, and always permitting access, but obfuscating the hostname. Imagine that Teleport is the only means by which an admin has access to the host to change the hostname and we've now prevented them from being able to modify the hostname by rejecting here.

Copy link
Copy Markdown
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in a few comments, I don't think we should proceed with an approach where we are rejecting heartbeats because of an invalid hostname. I've opened #48988 with the approach that we discussed about moving the invalid hostnames to a label and replacing the hostname with a valid alternative.

@rosstimothy
Copy link
Copy Markdown
Contributor

Superseded by #48988.

@capnspacehook capnspacehook deleted the capnspacehook/validate-node-hostnames; branch November 12, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants