RFD 0167: Automatic updates change proposal#39217
RFD 0167: Automatic updates change proposal#39217bernardjkim wants to merge 7 commits intomasterfrom
Conversation
|
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
|
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
|
|
||
| Yum has a similar feature that can exclude packages from a system update. This can be done by specifying teleport-ent to be excluded in the /etc/yum.conf file. | ||
|
|
||
| Step 3: The Teleport installation process should no longer rely on the package manager to download teleport-ent packages. Instead, the Teleport proxy will now serve the latest compatible version of the teleport-ent package. The Teleport updater will then be responsible for downloading and installing the teleport-ent packages from the Teleport proxy. |
There was a problem hiding this comment.
What's the plan for getting the artifacts securely delivered to the proxy?
There was a problem hiding this comment.
I'm not sure we want to tunnel 500MiB blobs through the proxy to 10k agents in the same maintenance window. We'll have a bad availability, slow speed, pay a lot of bandwidth.
I would rather have the artifacts stored into an S3 bucket/CDN/whatever and the proxy to serve the bucket address and desired version.
If we part from the package manager, we must solve the trust issue, exactly like we had to do for Kubernetes. The artifacts must be signed. The public key can be baked into the updater, or retrieved dynamically, maybe from the proxy. Serving the key from the proxy offers less guarantees if someone can MITM with valid TLS, but it would allow us to rotate the signing key without contacting all customers to redeploy.
There was a problem hiding this comment.
I would rather have the artifacts stored into an S3 bucket/CDN/whatever and the proxy to serve the bucket address and desired version.
Being able to curl off the proxy would be nice, but it should redirect to a CDN.
There was a problem hiding this comment.
I'm not sure we want to send the CDN URL from the proxy. This trades a lot of security guarantees against a bit of usability (if we own the domain, we can host whatever we want on it).
- if the agent reads the CDN URL from the proxy redirect: someone can takeover the proxy and have the agents install arbitrary things, especially if the proxy also serves the signature keys
- if the agent reads the version from the proxy but has a built-in CDN URL, someone taking over the proxy cannot force agents to install arbitrary content. Only content hosted in the teleport CDN can be downloaded. So it would require taking over the CDN as well (else the only thing you can do is a downgrade attack)
I think a middle ground would be having the proxy return the artifact name and the updater set the domain. This way, we can move artifacts and change the CDN structure remotely, but agents are not installing something random. It would allow users to configure the updater to pull through a proxy/mirror/cache if they have company policies in place regarding downloads.
There was a problem hiding this comment.
However I'd love the install script from curl https://goteleport.com/static/install.sh | bash -s 15.1.4 to be served by the proxy.
So the installation method would be curl https://mytenant.teleport.sh/something/install.sh | bash
There was a problem hiding this comment.
What SLO do we have for auto updates and/or cloud? What SLO do we need to have for the release service (currently don't have any)?
There was a problem hiding this comment.
if the agent reads the CDN URL from the proxy redirect: someone can takeover the proxy and have the agents install arbitrary things, especially if the proxy also serves the signature keys
So the installation method would be curl https://mytenant.teleport.sh/something/install.sh | bash
Difference here is just that it's less scary for initial install?
If I take over the proxy, I can serve arbitrary install and discover scripts.
There was a problem hiding this comment.
If I take over the proxy, I can serve arbitrary install and discover scripts.
Installation time and update time are not equivalent. If you take over a proxy, you can compromise new agents by changing the bash script served at install time. However, you cannot compromise existing agents, that's not a 0-click escalation.
This is already the case for regular joins, the agent just trusts whatever auth cert is returned by the proxy during theinitial join. The agent can be stolen if you have valid TLS certs. However, an already enrolled agent will check the cluster CAs and cannot be stolen. The automatic update mechanism should provide the same security guarantees.
Proving the bash script authenticity is hard because:
- we don't have established trust except via the OS trust store, which is kinda weak
- the dynamic part injected by the proxy (version, URL, automatic updates, oss/ent, architecture, package manager, ...) blocks us from hashing the bash script
I'm not sure we have to solve the "bootstrapping in a compromised environment" problem as Teleport doesn't even solve it for a regular agent join (CA pins are not used when joining via proxy).
| ### Deprecate the stable/cloud teleport-ent package | ||
| Currently, the teleport-ent-updater package requires the teleport-ent package as a dependency. This means that the user must install the latest version of the teleport-ent package which may or may not be compatible with their Teleport control plane, or they must first specify a compatible version of teleport-ent to install. This puts unnecessary burden on the user, and complicates the installation process. | ||
|
|
||
| Step 1: To remove this burden from the user and simplify the installation process, the Teleport updater will support an install command. The install command accepts the necessary configuration and then installs the latest compatible version of the teleport-ent package for the user. |
There was a problem hiding this comment.
We discussed removing the package entirely. Are you still working on this section?
| ## Current Limitations | ||
|
|
There was a problem hiding this comment.
I would like to add another current limitation: the automatic update process only covers the agents. Not the integrations.
Since v15 we have Teleport Operator users against Cloud instances, we already had users self-hosting plugins (see slack link).
I suspect that as Teleport Cloud grows and operator adoption increases, similar issues will arise. I think the proposed solution should be designed to be extended in the future to cover the integrations, at least self-hosted plugins and the operator.
|  | ||
|
|
||
| ## Current Limitations | ||
|
|
There was a problem hiding this comment.
Another limitation Cloud is not facing directly is that automatic updates are cloud-only. Because AUs are not properly supported and adopted everywhere, in Teleport code we have a special execution path "if Cloud". This increases the difference between cloud and self-hosted, the probability of a bug happening. This path is typically less tested and expensive to maintain.
Getting rid of the special case would prevent incidents like the discovery one from 2 weeks ago.
|
|
||
| Step 2: Teleport must provide an alternate method of securely downloading the latest compatible version of Teleport. The Teleport CDN already serves Teleport artifacts and their SHA256 checksums. To ensure authenticity of these Teleport downloads, Teleport will now need to sign these artifacts and provide the digital signatures. The public key required to verify the digital signature will be baked into the Teleport updater. | ||
|
|
||
| Step 3: The Teleport updater must control all Teleport updates. To ensure this, the Teleport updater will no longer rely on the system package manager to install/update Teleport. Users will no longer be able to manually update Teleport through the system package manager. Instead, the Teleport updater will now download the Teleport artifacts from the Teleport CDN along with the SHA256 checksum and the digital signature. The Teleport updater will verify the Teleport artifact and install Teleport into the /var/lib/teleport directory. |
There was a problem hiding this comment.
digital signature
I think @russjones had some thoughts about scope and key management for this part.
|
|
||
| Step 3: The Teleport updater must control all Teleport updates. To ensure this, the Teleport updater will no longer rely on the system package manager to install/update Teleport. Users will no longer be able to manually update Teleport through the system package manager. Instead, the Teleport updater will now download the Teleport artifacts from the Teleport CDN along with the SHA256 checksum and the digital signature. The Teleport updater will verify the Teleport artifact and install Teleport into the /var/lib/teleport directory. | ||
|
|
||
| The Teleport updater must be able to resolve conflict for these two situations: |
There was a problem hiding this comment.
A few questions here:
- For existing installations, would it make sense to run
teleport-upgrade installautomatically in post-install scripts, given that we can detect the proxy address? - Should we remove the existing teleport package if it exists, or just ensure that the auto-upgrader version of teleport is executed by the systemd service?
There was a problem hiding this comment.
For existing installations, would it make sense to run teleport-upgrade install automatically in post-install scripts, given that we can detect the proxy address?
I was thinking the install script would run teleport-upgrade install right after installing the updater. Is what you're thinking a bit different? What would be the benefit of running teleport-upgrade install automatically in a post-install script?
Should we remove the existing teleport package if it exists, or just ensure that the auto-upgrader version of teleport is executed by the systemd service?
Hmm, I'm thinking the second option would be more resilient? I'm thinking if the teleport package is still available for installation, there is probably a good chance that users reinstall the package unintentionally?
| The agent version should be made easier to configure using the Kubernetes or Cloud API. Modifying the agent version should not require reconciliation to be paused, and it should not require the Teleport proxy to be redeployed. | ||
|
|
||
| A simple solution is to configure the Teleport proxies to now read the agent version from a monitored file on disk. Teleport Cloud will be able to easily and dynamically modify the agent version via the Kubernetes API. |
There was a problem hiding this comment.
Chatting with @russjones, we should centralize this configuration to a Teleport resource. For Cloud, we can schedule work to allow our controllers to make the auth changes directly.
There was a problem hiding this comment.
Are we thinking we add configuration to this autoupdate_version resource proposed in the client tools rfd?
kind: autoupdate_version
spec:
# tools_version is the version of client tools the cluster will advertise.
# Can be auto (match the version of the proxy) or an exact semver formatted
# version.
tools_version: auto|X.Y.Z
|
|
||
| Step 4: The Teleport documentation should be updated to include a new section with instructions about how a user can build their own update automation. | ||
|
|
||
| ### Agent Version Management |
There was a problem hiding this comment.
Can you add the proposed API for this?
Seems like we need at least three fields:
- Version
- Upgrade time OR immediate (can be a specific timestamp -- Cloud can translate window -> time)
- Jitter duration
And out-of-scope for this RFD, but eventually:
- Client bucket ID, for staged rollouts (upgrader holds off until the ID matches locally configured value)
|
Superseded by #39805. @bernardjkim closing this for now, but feel free to reopen if you have any concerns |
Rendered
This RFD proposes a change to the automatic updates design. The design has a number of limitations that are incompatible with the needs of Teleport Cloud. This RFD provides an overview of the current issues and some potential solutions. The RFD includes minimal implementation details. A separate execution plan with more more implementation details will be created, if the change proposals are approved.
Related Issues: