Verify a GPG signature on the release image digest from a root of trust within the payload #170

smarterclayton · 2019-04-20T04:32:04Z

Implement:

A payload can contain a root of trust config map that describes GPG public keys and one or more sources of signatures
A verifier that checks the root of trust config map if present in the payload and fails if a signature cannot be found for all public keys from any of the stores
Change the CVO to default to "don't allow unverified images", block update to images that haven't been verified, and allow an admin to override that with a field in desiredUpdate allowUnverifiedImages
Report unverified errors in CVO status as well as the verify bit on the history

In CI, the payload would have a signing key public key mixed in to the payload (OKD, OCP CI). In OCP, we would mix in an official signing public key and the location of one or more stores to read from.

Questions identified during prototyping:

The sync loop does nothing when desiredUpdate is failing to retrieve a payload - we need to "under sync" in that case (continue to sync whatever the last payload was) until the payload is retrieved
We need to allow additional stores to be specified on a cluster that define where else source can be located from. That could be a second config map we mix in that is create only, or it could be global config.
What failure modes are not covered here? The upstream stores going down can be handled by background retry and allowing override.
Update isn't calling oc adm release info --verify, which it should (prevents the payload from being unreachable or inconsistent). However, we'd have to wire through the pull secret into the container that runs oc.
Is this too paranoid? Escape hatch of --force makes me feel better, especially when we do automated updates. I would rather have all of our testing using a signature rather than only what end users see.
Use something besides atomic container image? Would prefer not to because a) it's reasonably well reviewed, b) there is already red hat detached signature infra that can publish signatures (we have to talk with RCM more).

TODO:

Clients have to change (oc adm upgrade and web console)
Get the CI release signer and verifier up, and public signatures to GCS or other public location
Additional testing
Decide whether the payload should consider its own payload (the innate one) as verified implicitly. Answer is probably not.

smarterclayton · 2019-04-21T19:38:56Z

/test e2e-aws

pkg/cvo/cvo.go

pkg/verify/verify.go

pkg/cvo/cvo.go

smarterclayton · 2019-04-23T02:24:52Z

Simplified down to fix two problems:

Don't try to reverify unverified payloads - the fact that local doesn't need to be verified complicates that, and in the future we can implement caching and other niceties around the verifier and have it be invoked before the retriever
Remove the negation of UnverifiedImage and weirdness of the signature of RetrievePayload by having it return a struct with the data needed by the caller.

Added another cvo_scenario test that showed different logic.

smarterclayton · 2019-04-23T02:33:34Z

/retest

smarterclayton · 2019-04-23T16:07:18Z

/retest

pkg/cvo/status.go

smarterclayton · 2019-04-23T17:31:07Z

Are we ok with API changes? If so #293 needs to be merged before this and I need to bump the API.

smarterclayton · 2019-04-26T13:55:29Z

WTH

/retest

abhinavdahiya · 2019-04-26T13:59:19Z

WTH

/retest

Looks like there is a merge conflict @smarterclayton

All operators should use the message policy fallback to grab a final error message, if necessary.

Only bring '-v' in because that one we care about.

Use the existing Atomic image signing protocol to read detached signatures for images by digest from a remote location so that release images can be verified before they are executed. Add a Verifier interface to the CVO that abstracts checking for verified updates. On start, check the payload for a config map with the annotation release.openshift.io/verification-config-map set (value is ignored) and load the set of all public keys that must be verified along with the http or file store locations for detached signatures. Every key must be verified to accept the payload. A subsequent commit will leverage the Verifier to block downloading a new release image.

The verifier is loaded at startup and passed to the payload retriever, which then delegates to the verifier and returns an error if the digest is not authentic. If no verifier is configured, the default verifier is to reject the payload as inauthentic. Move the initialization of the sync worker to after the verifier is loaded so that we can pass the verifier into the payload retriever, which makes the New() method slightly smaller and separates out payload related actions from controller related actions.

For safety, all unverified images are considered untrusted and the CVO should not apply them. An admin may override this safe default with force (which would be mapped to oc adm upgrade --force). When this happens, we should communicate clearly which updates were verified and which were not. The CVO, unless configured otherwise, assumes everything is unverified. The payload retriever uses the verifier to check the incoming image - both tags or failed verification is reported to the user. If the admin then sets allowUnverifiedImages, the sync loop continues. Periodically in the background recheck any reconciling payload that is unverified against the verifier, just in case it was a temporary state.

abhinavdahiya · 2019-04-26T15:20:07Z

/lgtm

openshift-ci-robot · 2019-04-26T15:20:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhinavdahiya · 2019-04-26T16:20:55Z

/retest

openshift-ci-robot · 2019-04-26T17:44:25Z

@smarterclayton: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws	`f8eff25`	link	`/test e2e-aws`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

smarterclayton · 2019-04-26T18:08:50Z

Known flake, force merging.

Adding a client-side guard like the server-side guard [1] from openshift/cluster-version-operator@55e3cb450f (verifier: Add public key verification of release image digests, 2019-04-19, openshift/cluster-version-operator#170). This isn't that big a deal, because the client-side guard exists, but it does save users the effort of pushing a by-tag pullspec into ClusterVersion and then having to circle back around to notice that the cluster-version operator is complaining about the verification failure. The warning on --force is because Clayton considers blocking this flow completely to be a breaking API change, and also considers that failing those folks early is more invasive than letting them continue to slip through here and have some subset of them potentially blow up later if: 1. You ask to update to a by-tag pullspec. 2. Cluster updates. 3. Someone clobbers the tag you used in the registry to point it at a different release. 3. Cluster continues on, blissfully unaware. 4. CVO gets rescheduled for whatever reason. 5. Cluster pulls a fresh registry image for the new pod, but it's by-tag, so you get the new content. 6. CVO thinks it's still in reconciling mode, because the pullspec hasn't changed. 7. World explodes as the new manifests get applied in a parallel, randomized order. Or maybe the new CVO defaults to starting in install mode or some such. But still, random release rollouts trigged by CVO pod restarts are more exitement than I'd wish on anyone, even folks who use --force. It would also be acceptable to have oc attempt to resolve the by-tag pullspec into a by-digest pullspec, but there's no guarantee that the host running 'oc' has access to the same registry that the in-cluster CRI-O will be pulling from, so getting consistent client-side resolution would be tricky. [1]: https://github.com/openshift/cluster-version-operator/blame/89cb270523675e27a2f54918431170946636f5d5/pkg/verify/verify.go#L203-L204

Adding a client-side guard like the server-side guard [1] from openshift/cluster-version-operator@55e3cb450f (verifier: Add public key verification of release image digests, 2019-04-19, openshift/cluster-version-operator#170). This isn't that big a deal, because the client-side guard exists, but it does save users the effort of pushing a by-tag pullspec into ClusterVersion and then having to circle back around to notice that the cluster-version operator is complaining about the verification failure. The warning on --force (instead of a hard failure) is because Clayton considers blocking this flow completely to be a breaking API change, and also considers that failing those folks early is more invasive than letting them continue to slip through here and have some subset of them potentially blow up later if: 1. You ask to update to a by-tag pullspec. 2. Cluster updates. 3. Someone clobbers the tag you used in the registry to point it at a different release. 4. Cluster continues on, blissfully unaware. 5. CVO gets rescheduled for whatever reason. 6. Cluster pulls a fresh registry image for the new pod, but it's by-tag, so you get the new content. 7. CVO thinks it's still in reconciling mode, because the pullspec hasn't changed. 8. World explodes as the new manifests get applied in a parallel, randomized order. Or maybe the new CVO defaults to starting in install mode or some such. But still, random release rollouts trigged by CVO pod restarts are more exitement than I'd wish on anyone, even folks who use --force. It would also be acceptable to have oc attempt to resolve the by-tag pullspec into a by-digest pullspec, but there's no guarantee that the host running 'oc' has access to the same registry that the in-cluster CRI-O will be pulling from, so getting consistent client-side resolution would be tricky. [1]: https://github.com/openshift/cluster-version-operator/blame/89cb270523675e27a2f54918431170946636f5d5/pkg/verify/verify.go#L203-L204

Commit f8eff25 (PR openshift#170) missed to update the doc string Signed-off-by: Lalatendu Mohanty <[email protected]>

These docs have not been adjusted since the properties landed in ab4ff93 (Update ClusterVersion to have a 'force' update flag and track verified, 2019-04-22, openshift#293). The initial cluster-version operator implementation only used them for signature checks, openshift/cluster-version-operator@f8eff25191 (sync: Report whether updates are verified and allow admin override, 2019-04-20, openshift/cluster-version-operator#170). At that time, RetrievePayload would always attempt to verify the proposed release image, including both signature validation and "does this look like a real release image?" checks. If that verification succeeded, it would set 'verified: true'. If that verification failed, it would look at 'force', and if 'force' was true, it would set 'verified: false' and proceed with the update, while if 'force' was false, it would reject the update. Despite the API docs, availableUpdates never came into this in the CVO code. The only availableUpdates guards are client-side in 'oc', and the API docs are discussion the ClusterVersion API, not 'oc adm upgrade's --force API. In openshift/oc@0501d04ff1 (upgrade: Separate flags for safety instead of abusing force, 2019-09-27, openshift/oc#109), oc's client-side guards split --allow-explicit-upgrade (for updating to something not in availableUpdates) and --allow-upgrade-with-warnings (for updating despite client-side guards against Progressing and Degraded ClusterVersion conditions) away from --force, but again, never part of the ClusterVersion force property's CVO-side handling. Since then, the CVO grew precondition handlers in openshift/cluster-version-operator@aceb5bc66f (pkg: add precondition handlers and perform these checks before accepting Update, 2019-08-26, openshift/cluster-version-operator#243). syncOnce ran all the precondition checks, which at that point was just the Upgradeable ClusterOperator and ClusterVersion condition check. If that check failed, but 'force' was true, it would proceed with the update, while if 'force' was false, it would reject the update. In neither case would 'verified' be altered, it continued to only track the signature and "does this look like a real release image?" checks. The idea at the time was that these would fall under the "normal protections" phrasing (overriden by force) and not the "other forms of consistency checking" phrasing (whatever those were supposed to be).

These docs have not been adjusted since the properties landed in ab4ff93 (Update ClusterVersion to have a 'force' update flag and track verified, 2019-04-22, openshift#293). The initial cluster-version operator implementation only used them for signature checks, openshift/cluster-version-operator@f8eff25191 (sync: Report whether updates are verified and allow admin override, 2019-04-20, openshift/cluster-version-operator#170). At that time, RetrievePayload would always attempt to verify the proposed release image, including both signature validation and "does this look like a real release image?" checks. If that verification succeeded, it would set 'verified: true'. If that verification failed, it would look at 'force', and if 'force' was true, it would set 'verified: false' and proceed with the update, while if 'force' was false, it would reject the update. Despite the API docs, availableUpdates never came into this in the CVO code. The only availableUpdates guards are client-side in 'oc', and the API docs are discussion the ClusterVersion API, not 'oc adm upgrade's --force API. In openshift/oc@0501d04ff1 (upgrade: Separate flags for safety instead of abusing force, 2019-09-27, openshift/oc#109), oc's client-side guards split --allow-explicit-upgrade (for updating to something not in availableUpdates) and --allow-upgrade-with-warnings (for updating despite client-side guards against Progressing and Degraded ClusterVersion conditions) away from --force, but again, never part of the ClusterVersion force property's CVO-side handling. Since then, the CVO grew precondition handlers in openshift/cluster-version-operator@aceb5bc66f (pkg: add precondition handlers and perform these checks before accepting Update, 2019-08-26, openshift/cluster-version-operator#243). syncOnce ran all the precondition checks, which at that point was just the Upgradeable ClusterOperator and ClusterVersion condition check. If that check failed, but 'force' was true, it would proceed with the update, while if 'force' was false, it would reject the update. In neither case would 'verified' be altered; it continued to only track the signature and "does this look like a real release image?" checks. The idea at the time was that these would fall under the "normal protections" phrasing (overriden by force) and not the "other forms of consistency checking" phrasing (whatever those were supposed to be).

openshift-ci-robot requested a review from abhinavdahiya April 20, 2019 04:32

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2019

openshift-ci-robot requested a review from crawford April 20, 2019 04:32

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 20, 2019

smarterclayton force-pushed the sign_2 branch 2 times, most recently from 69d04cc to 780e451 Compare April 20, 2019 17:46

smarterclayton changed the title ~~DO NOT MERGE: Prototype signing the release payload~~ DO NOT MERGE: Verify a GPG signature on the release image digest from a root of trust within the payload Apr 20, 2019

smarterclayton force-pushed the sign_2 branch from e12d69d to c13ffc9 Compare April 20, 2019 21:05

smarterclayton force-pushed the sign_2 branch 2 times, most recently from cf9f6de to ad61a26 Compare April 22, 2019 22:13