Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Apr 20, 2019

Implement:

  1. A payload can contain a root of trust config map that describes GPG public keys and one or more sources of signatures
  2. A verifier that checks the root of trust config map if present in the payload and fails if a signature cannot be found for all public keys from any of the stores
  3. Change the CVO to default to "don't allow unverified images", block update to images that haven't been verified, and allow an admin to override that with a field in desiredUpdate allowUnverifiedImages
  4. Report unverified errors in CVO status as well as the verify bit on the history

In CI, the payload would have a signing key public key mixed in to the payload (OKD, OCP CI). In OCP, we would mix in an official signing public key and the location of one or more stores to read from.

Questions identified during prototyping:

  1. The sync loop does nothing when desiredUpdate is failing to retrieve a payload - we need to "under sync" in that case (continue to sync whatever the last payload was) until the payload is retrieved
  2. We need to allow additional stores to be specified on a cluster that define where else source can be located from. That could be a second config map we mix in that is create only, or it could be global config.
  3. What failure modes are not covered here? The upstream stores going down can be handled by background retry and allowing override.
  4. Update isn't calling oc adm release info --verify, which it should (prevents the payload from being unreachable or inconsistent). However, we'd have to wire through the pull secret into the container that runs oc.
  5. Is this too paranoid? Escape hatch of --force makes me feel better, especially when we do automated updates. I would rather have all of our testing using a signature rather than only what end users see.
  6. Use something besides atomic container image? Would prefer not to because a) it's reasonably well reviewed, b) there is already red hat detached signature infra that can publish signatures (we have to talk with RCM more).

TODO:

  • Clients have to change (oc adm upgrade and web console)
  • Get the CI release signer and verifier up, and public signatures to GCS or other public location
  • Additional testing
  • Decide whether the payload should consider its own payload (the innate one) as verified implicitly. Answer is probably not.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2019
@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 20, 2019
@smarterclayton smarterclayton force-pushed the sign_2 branch 2 times, most recently from 69d04cc to 780e451 Compare April 20, 2019 17:46
@smarterclayton smarterclayton changed the title DO NOT MERGE: Prototype signing the release payload DO NOT MERGE: Verify a GPG signature on the release image digest from a root of trust within the payload Apr 20, 2019
@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@smarterclayton smarterclayton force-pushed the sign_2 branch 2 times, most recently from cf9f6de to ad61a26 Compare April 22, 2019 22:13
@smarterclayton smarterclayton force-pushed the sign_2 branch 2 times, most recently from 7227456 to 83c0ffb Compare April 23, 2019 02:23
@smarterclayton
Copy link
Contributor Author

Simplified down to fix two problems:

  1. Don't try to reverify unverified payloads - the fact that local doesn't need to be verified complicates that, and in the future we can implement caching and other niceties around the verifier and have it be invoked before the retriever
  2. Remove the negation of UnverifiedImage and weirdness of the signature of RetrievePayload by having it return a struct with the data needed by the caller.

Added another cvo_scenario test that showed different logic.

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

Are we ok with API changes? If so #293 needs to be merged before this and I need to bump the API.

@smarterclayton
Copy link
Contributor Author

WTH

/retest

@abhinavdahiya
Copy link
Contributor

WTH

/retest

Looks like there is a merge conflict @smarterclayton

All operators should use the message policy fallback to grab a final
error message, if necessary.
Only bring '-v' in because that one we care about.
Use the existing Atomic image signing protocol to read detached signatures
for images by digest from a remote location so that release images can
be verified before they are executed.

Add a Verifier interface to the CVO that abstracts checking for verified
updates. On start, check the payload for a config map with the annotation
release.openshift.io/verification-config-map set (value is ignored) and
load the set of all public keys that must be verified along with the http
or file store locations for detached signatures. Every key must be
verified to accept the payload. A subsequent commit will leverage the
Verifier to block downloading a new release image.
The verifier is loaded at startup and passed to the payload retriever,
which then delegates to the verifier and returns an error if the
digest is not authentic. If no verifier is configured, the default
verifier is to reject the payload as inauthentic.

Move the initialization of the sync worker to after the verifier is
loaded so that we can pass the verifier into the payload retriever,
which makes the New() method slightly smaller and separates out
payload related actions from controller related actions.
For safety, all unverified images are considered untrusted and the
CVO should not apply them. An admin may override this safe default
with force (which would be mapped to oc adm upgrade --force). When
this happens, we should communicate clearly which updates were
verified and which were not.

The CVO, unless configured otherwise, assumes everything is unverified.
The payload retriever uses the verifier to check the incoming image -
both tags or failed verification is reported to the user. If the admin
then sets allowUnverifiedImages, the sync loop continues.

Periodically in the background recheck any reconciling payload that
is unverified against the verifier, just in case it was a temporary
state.
@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@abhinavdahiya
Copy link
Contributor

/retest

@openshift-ci-robot
Copy link
Contributor

@smarterclayton: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws f8eff25 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@smarterclayton
Copy link
Contributor Author

Known flake, force merging.

@smarterclayton smarterclayton merged commit 398afc1 into openshift:master Apr 26, 2019
wking added a commit to wking/oc that referenced this pull request Apr 16, 2020
Adding a client-side guard like the server-side guard [1] from
openshift/cluster-version-operator@55e3cb450f (verifier: Add public
key verification of release image digests, 2019-04-19,
openshift/cluster-version-operator#170).  This isn't that big a deal,
because the client-side guard exists, but it does save users the
effort of pushing a by-tag pullspec into ClusterVersion and then
having to circle back around to notice that the cluster-version
operator is complaining about the verification failure.

The warning on --force is because Clayton considers blocking this flow
completely to be a breaking API change, and also considers that
failing those folks early is more invasive than letting them continue
to slip through here and have some subset of them potentially blow up
later if:

1. You ask to update to a by-tag pullspec.
2. Cluster updates.
3. Someone clobbers the tag you used in the registry to point it at a
   different release.
3. Cluster continues on, blissfully unaware.
4. CVO gets rescheduled for whatever reason.
5. Cluster pulls a fresh registry image for the new pod, but it's
   by-tag, so you get the new content.
6. CVO thinks it's still in reconciling mode, because the pullspec
   hasn't changed.
7. World explodes as the new manifests get applied in a parallel,
   randomized order.

Or maybe the new CVO defaults to starting in install mode or some
such.  But still, random release rollouts trigged by CVO pod restarts
are more exitement than I'd wish on anyone, even folks who use
--force.

It would also be acceptable to have oc attempt to resolve the by-tag
pullspec into a by-digest pullspec, but there's no guarantee that the
host running 'oc' has access to the same registry that the in-cluster
CRI-O will be pulling from, so getting consistent client-side
resolution would be tricky.

[1]: https://github.com/openshift/cluster-version-operator/blame/89cb270523675e27a2f54918431170946636f5d5/pkg/verify/verify.go#L203-L204
wking added a commit to wking/oc that referenced this pull request Apr 16, 2020
Adding a client-side guard like the server-side guard [1] from
openshift/cluster-version-operator@55e3cb450f (verifier: Add public
key verification of release image digests, 2019-04-19,
openshift/cluster-version-operator#170).  This isn't that big a deal,
because the client-side guard exists, but it does save users the
effort of pushing a by-tag pullspec into ClusterVersion and then
having to circle back around to notice that the cluster-version
operator is complaining about the verification failure.

The warning on --force (instead of a hard failure) is because Clayton
considers blocking this flow completely to be a breaking API change,
and also considers that failing those folks early is more invasive
than letting them continue to slip through here and have some subset
of them potentially blow up later if:

1. You ask to update to a by-tag pullspec.
2. Cluster updates.
3. Someone clobbers the tag you used in the registry to point it at a
   different release.
4. Cluster continues on, blissfully unaware.
5. CVO gets rescheduled for whatever reason.
6. Cluster pulls a fresh registry image for the new pod, but it's
   by-tag, so you get the new content.
7. CVO thinks it's still in reconciling mode, because the pullspec
   hasn't changed.
8. World explodes as the new manifests get applied in a parallel,
   randomized order.

Or maybe the new CVO defaults to starting in install mode or some
such.  But still, random release rollouts trigged by CVO pod restarts
are more exitement than I'd wish on anyone, even folks who use
--force.

It would also be acceptable to have oc attempt to resolve the by-tag
pullspec into a by-digest pullspec, but there's no guarantee that the
host running 'oc' has access to the same registry that the in-cluster
CRI-O will be pulling from, so getting consistent client-side
resolution would be tricky.

[1]: https://github.com/openshift/cluster-version-operator/blame/89cb270523675e27a2f54918431170946636f5d5/pkg/verify/verify.go#L203-L204
wking added a commit to wking/oc that referenced this pull request May 5, 2020
Adding a client-side guard like the server-side guard [1] from
openshift/cluster-version-operator@55e3cb450f (verifier: Add public
key verification of release image digests, 2019-04-19,
openshift/cluster-version-operator#170).  This isn't that big a deal,
because the client-side guard exists, but it does save users the
effort of pushing a by-tag pullspec into ClusterVersion and then
having to circle back around to notice that the cluster-version
operator is complaining about the verification failure.

The warning on --force (instead of a hard failure) is because Clayton
considers blocking this flow completely to be a breaking API change,
and also considers that failing those folks early is more invasive
than letting them continue to slip through here and have some subset
of them potentially blow up later if:

1. You ask to update to a by-tag pullspec.
2. Cluster updates.
3. Someone clobbers the tag you used in the registry to point it at a
   different release.
4. Cluster continues on, blissfully unaware.
5. CVO gets rescheduled for whatever reason.
6. Cluster pulls a fresh registry image for the new pod, but it's
   by-tag, so you get the new content.
7. CVO thinks it's still in reconciling mode, because the pullspec
   hasn't changed.
8. World explodes as the new manifests get applied in a parallel,
   randomized order.

Or maybe the new CVO defaults to starting in install mode or some
such.  But still, random release rollouts trigged by CVO pod restarts
are more exitement than I'd wish on anyone, even folks who use
--force.

It would also be acceptable to have oc attempt to resolve the by-tag
pullspec into a by-digest pullspec, but there's no guarantee that the
host running 'oc' has access to the same registry that the in-cluster
CRI-O will be pulling from, so getting consistent client-side
resolution would be tricky.

[1]: https://github.com/openshift/cluster-version-operator/blame/89cb270523675e27a2f54918431170946636f5d5/pkg/verify/verify.go#L203-L204
LalatenduMohanty added a commit to LalatenduMohanty/cluster-version-operator that referenced this pull request Oct 15, 2020
Commit f8eff25 (PR openshift#170) missed to update the doc string

Signed-off-by: Lalatendu Mohanty <[email protected]>
wking added a commit to wking/openshift-api that referenced this pull request Apr 7, 2021
These docs have not been adjusted since the properties landed in
ab4ff93 (Update ClusterVersion to have a 'force' update flag and
track verified, 2019-04-22, openshift#293).  The initial cluster-version
operator implementation only used them for signature checks,
openshift/cluster-version-operator@f8eff25191 (sync: Report whether
updates are verified and allow admin override, 2019-04-20,
openshift/cluster-version-operator#170).  At that time,
RetrievePayload would always attempt to verify the proposed release
image, including both signature validation and "does this look like a
real release image?" checks.  If that verification succeeded, it would
set 'verified: true'.  If that verification failed, it would look at
'force', and if 'force' was true, it would set 'verified: false' and
proceed with the update, while if 'force' was false, it would reject
the update.

Despite the API docs, availableUpdates never came into this in the CVO
code.  The only availableUpdates guards are client-side in 'oc', and
the API docs are discussion the ClusterVersion API, not 'oc adm
upgrade's --force API.  In openshift/oc@0501d04ff1 (upgrade: Separate
flags for safety instead of abusing force, 2019-09-27,
openshift/oc#109), oc's client-side guards split
--allow-explicit-upgrade (for updating to something not in
availableUpdates) and --allow-upgrade-with-warnings (for updating
despite client-side guards against Progressing and Degraded
ClusterVersion conditions) away from --force, but again, never part of
the ClusterVersion force property's CVO-side handling.

Since then, the CVO grew precondition handlers in
openshift/cluster-version-operator@aceb5bc66f (pkg: add precondition
handlers and perform these checks before accepting Update, 2019-08-26,
openshift/cluster-version-operator#243).  syncOnce ran all the
precondition checks, which at that point was just the Upgradeable
ClusterOperator and ClusterVersion condition check.  If that check
failed, but 'force' was true, it would proceed with the update, while
if 'force' was false, it would reject the update.  In neither case
would 'verified' be altered, it continued to only track the signature
and "does this look like a real release image?" checks.  The idea at
the time was that these would fall under the "normal protections"
phrasing (overriden by force) and not the "other forms of consistency
checking" phrasing (whatever those were supposed to be).
wking added a commit to wking/openshift-api that referenced this pull request Apr 7, 2021
These docs have not been adjusted since the properties landed in
ab4ff93 (Update ClusterVersion to have a 'force' update flag and
track verified, 2019-04-22, openshift#293).  The initial cluster-version
operator implementation only used them for signature checks,
openshift/cluster-version-operator@f8eff25191 (sync: Report whether
updates are verified and allow admin override, 2019-04-20,
openshift/cluster-version-operator#170).  At that time,
RetrievePayload would always attempt to verify the proposed release
image, including both signature validation and "does this look like a
real release image?" checks.  If that verification succeeded, it would
set 'verified: true'.  If that verification failed, it would look at
'force', and if 'force' was true, it would set 'verified: false' and
proceed with the update, while if 'force' was false, it would reject
the update.

Despite the API docs, availableUpdates never came into this in the CVO
code.  The only availableUpdates guards are client-side in 'oc', and
the API docs are discussion the ClusterVersion API, not 'oc adm
upgrade's --force API.  In openshift/oc@0501d04ff1 (upgrade: Separate
flags for safety instead of abusing force, 2019-09-27,
openshift/oc#109), oc's client-side guards split
--allow-explicit-upgrade (for updating to something not in
availableUpdates) and --allow-upgrade-with-warnings (for updating
despite client-side guards against Progressing and Degraded
ClusterVersion conditions) away from --force, but again, never part of
the ClusterVersion force property's CVO-side handling.

Since then, the CVO grew precondition handlers in
openshift/cluster-version-operator@aceb5bc66f (pkg: add precondition
handlers and perform these checks before accepting Update, 2019-08-26,
openshift/cluster-version-operator#243).  syncOnce ran all the
precondition checks, which at that point was just the Upgradeable
ClusterOperator and ClusterVersion condition check.  If that check
failed, but 'force' was true, it would proceed with the update, while
if 'force' was false, it would reject the update.  In neither case
would 'verified' be altered; it continued to only track the signature
and "does this look like a real release image?" checks.  The idea at
the time was that these would fall under the "normal protections"
phrasing (overriden by force) and not the "other forms of consistency
checking" phrasing (whatever those were supposed to be).
wking added a commit to wking/openshift-api that referenced this pull request Jul 14, 2021
These docs have not been adjusted since the properties landed in
ab4ff93 (Update ClusterVersion to have a 'force' update flag and
track verified, 2019-04-22, openshift#293).  The initial cluster-version
operator implementation only used them for signature checks,
openshift/cluster-version-operator@f8eff25191 (sync: Report whether
updates are verified and allow admin override, 2019-04-20,
openshift/cluster-version-operator#170).  At that time,
RetrievePayload would always attempt to verify the proposed release
image, including both signature validation and "does this look like a
real release image?" checks.  If that verification succeeded, it would
set 'verified: true'.  If that verification failed, it would look at
'force', and if 'force' was true, it would set 'verified: false' and
proceed with the update, while if 'force' was false, it would reject
the update.

Despite the API docs, availableUpdates never came into this in the CVO
code.  The only availableUpdates guards are client-side in 'oc', and
the API docs are discussion the ClusterVersion API, not 'oc adm
upgrade's --force API.  In openshift/oc@0501d04ff1 (upgrade: Separate
flags for safety instead of abusing force, 2019-09-27,
openshift/oc#109), oc's client-side guards split
--allow-explicit-upgrade (for updating to something not in
availableUpdates) and --allow-upgrade-with-warnings (for updating
despite client-side guards against Progressing and Degraded
ClusterVersion conditions) away from --force, but again, never part of
the ClusterVersion force property's CVO-side handling.

Since then, the CVO grew precondition handlers in
openshift/cluster-version-operator@aceb5bc66f (pkg: add precondition
handlers and perform these checks before accepting Update, 2019-08-26,
openshift/cluster-version-operator#243).  syncOnce ran all the
precondition checks, which at that point was just the Upgradeable
ClusterOperator and ClusterVersion condition check.  If that check
failed, but 'force' was true, it would proceed with the update, while
if 'force' was false, it would reject the update.  In neither case
would 'verified' be altered; it continued to only track the signature
and "does this look like a real release image?" checks.  The idea at
the time was that these would fall under the "normal protections"
phrasing (overriden by force) and not the "other forms of consistency
checking" phrasing (whatever those were supposed to be).
wking added a commit to wking/openshift-api that referenced this pull request Jul 14, 2021
These docs have not been adjusted since the properties landed in
ab4ff93 (Update ClusterVersion to have a 'force' update flag and
track verified, 2019-04-22, openshift#293).  The initial cluster-version
operator implementation only used them for signature checks,
openshift/cluster-version-operator@f8eff25191 (sync: Report whether
updates are verified and allow admin override, 2019-04-20,
openshift/cluster-version-operator#170).  At that time,
RetrievePayload would always attempt to verify the proposed release
image, including both signature validation and "does this look like a
real release image?" checks.  If that verification succeeded, it would
set 'verified: true'.  If that verification failed, it would look at
'force', and if 'force' was true, it would set 'verified: false' and
proceed with the update, while if 'force' was false, it would reject
the update.

Despite the API docs, availableUpdates never came into this in the CVO
code.  The only availableUpdates guards are client-side in 'oc', and
the API docs are discussion the ClusterVersion API, not 'oc adm
upgrade's --force API.  In openshift/oc@0501d04ff1 (upgrade: Separate
flags for safety instead of abusing force, 2019-09-27,
openshift/oc#109), oc's client-side guards split
--allow-explicit-upgrade (for updating to something not in
availableUpdates) and --allow-upgrade-with-warnings (for updating
despite client-side guards against Progressing and Degraded
ClusterVersion conditions) away from --force, but again, never part of
the ClusterVersion force property's CVO-side handling.

Since then, the CVO grew precondition handlers in
openshift/cluster-version-operator@aceb5bc66f (pkg: add precondition
handlers and perform these checks before accepting Update, 2019-08-26,
openshift/cluster-version-operator#243).  syncOnce ran all the
precondition checks, which at that point was just the Upgradeable
ClusterOperator and ClusterVersion condition check.  If that check
failed, but 'force' was true, it would proceed with the update, while
if 'force' was false, it would reject the update.  In neither case
would 'verified' be altered; it continued to only track the signature
and "does this look like a real release image?" checks.  The idea at
the time was that these would fall under the "normal protections"
phrasing (overriden by force) and not the "other forms of consistency
checking" phrasing (whatever those were supposed to be).
wking added a commit to wking/openshift-api that referenced this pull request Jul 14, 2021
These docs have not been adjusted since the properties landed in
ab4ff93 (Update ClusterVersion to have a 'force' update flag and
track verified, 2019-04-22, openshift#293).  The initial cluster-version
operator implementation only used them for signature checks,
openshift/cluster-version-operator@f8eff25191 (sync: Report whether
updates are verified and allow admin override, 2019-04-20,
openshift/cluster-version-operator#170).  At that time,
RetrievePayload would always attempt to verify the proposed release
image, including both signature validation and "does this look like a
real release image?" checks.  If that verification succeeded, it would
set 'verified: true'.  If that verification failed, it would look at
'force', and if 'force' was true, it would set 'verified: false' and
proceed with the update, while if 'force' was false, it would reject
the update.

Despite the API docs, availableUpdates never came into this in the CVO
code.  The only availableUpdates guards are client-side in 'oc', and
the API docs are discussion the ClusterVersion API, not 'oc adm
upgrade's --force API.  In openshift/oc@0501d04ff1 (upgrade: Separate
flags for safety instead of abusing force, 2019-09-27,
openshift/oc#109), oc's client-side guards split
--allow-explicit-upgrade (for updating to something not in
availableUpdates) and --allow-upgrade-with-warnings (for updating
despite client-side guards against Progressing and Degraded
ClusterVersion conditions) away from --force, but again, never part of
the ClusterVersion force property's CVO-side handling.

Since then, the CVO grew precondition handlers in
openshift/cluster-version-operator@aceb5bc66f (pkg: add precondition
handlers and perform these checks before accepting Update, 2019-08-26,
openshift/cluster-version-operator#243).  syncOnce ran all the
precondition checks, which at that point was just the Upgradeable
ClusterOperator and ClusterVersion condition check.  If that check
failed, but 'force' was true, it would proceed with the update, while
if 'force' was false, it would reject the update.  In neither case
would 'verified' be altered; it continued to only track the signature
and "does this look like a real release image?" checks.  The idea at
the time was that these would fall under the "normal protections"
phrasing (overriden by force) and not the "other forms of consistency
checking" phrasing (whatever those were supposed to be).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants