IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763

wking · 2024-02-09T18:20:36Z

Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with
imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component, but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the from: 4[.]13[.].*. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case.

The registry does not persist these *_count values across container restarts, so there is a window of exposure like:

Registry activity increments imageregistry_request_duration_seconds_count counters.
PromQL correctly identifies "this cluster is using its registry".
Registry container restarts, zeroing its *_count buckets.
If all registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away.
New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low, outside of:

Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count.
Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count".

And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

$ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

openshift-ci-robot · 2024-02-09T18:20:40Z

@wking: This pull request references IR-461 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set.

Details

In response to this:

Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with
imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component, but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the from: 4[.]13[.].*. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case.

The registry does not persist these *_count values across container restarts, so there is a window of exposure like:

Registry activity increments imageregistry_request_duration_seconds_count counters.

PromQL correctly identifies "this cluster is using its registry".

Registry container restarts, zeroing its *_count buckets.

If all registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away.

New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low, outside of:

Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count.

Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count".

And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:
$ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking · 2024-02-09T18:34:14Z

Testing the new PromQL in 4.13.33's azure-ovn-upgrade-4.13-micro run in PromeCIeus, it notices some registry use late in the update after the nodes rolled:

wking · 2024-02-09T18:47:01Z

And testing vs. 4.13.33's aws-sdn-upgrade-4.13-micro run in PromeCIeus, it sees both "no registry activity" and "detected registry activity", but knows it is not exposed in either case because it is AWS and not Azure:

…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". ARO is comfortable waiting for the bugfix to ship, and does not want the complicated "registry in use?" detection, so they have an 'or' entry in that stanza to ensure it will always match on ARO clusters. And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities

bennerv · 2024-02-09T19:07:32Z

LGTM on aro:

wking · 2024-02-09T19:13:45Z

And refreshing my 4.13.33 screenshots, now that we've added the ARO condition, I get similar results to what I got before, and they recognize that they aren't ARO clusters:

sdodson · 2024-02-09T19:16:21Z

/lgtm

openshift-ci · 2024-02-09T19:17:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The PromQL is: * If we know IPsec is enabled, we're exposed, or... * If we know IPsec is not enabled, we're not exposed, or... * If are OVN, but aren't sure if IPsec is enabled (e.g. ovnkube_master_ipsec_enabled scraping is failing), we might be exposed. -1 will cause an evaluation failure [1]. Or... * If we know we are not OVN, we are not exposed, or... * If we aren't sure we're on OVN (e.g. apiserver_storage_objects scraping is failing), we might be exposed. Returning no time series will cause an evaluation failure. The label_replace makes it easier to tell when we're in the "we know IPsec is not enabled" case. The max_over_time avoids hiccuping if metrics are interrupted; see 5e4480f (blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use, 2024-02-09, #4763). I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, #3591). Generated by creating the 4.14.0 file by hand and copying it out to the other 4.14 releases with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.]0$' | while read VERSION; do sed "s/4.14.0/${VERSION}/" blocked-edges/4.14.0-OVNInterConnectTransitionIPsec.yaml > "blocked-edges/${VERSION}-OVNInterConnectTransitionIPsec.yaml"; done [1]: https://github.com/openshift/enhancements/blob/4668a0825c59739dfafd2ae661c16cf30f540946/enhancements/update/targeted-update-edge-blocking.md?plain=1#L119

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 9, 2024

openshift-ci bot requested review from PratikMahajan and petr-muller February 9, 2024 18:22

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2024

wking force-pushed the AzureRegistryImagePreservation-active-registry-use branch from 251d5ae to f6eef5d Compare February 9, 2024 18:24

wking force-pushed the AzureRegistryImagePreservation-active-registry-use branch from f6eef5d to 5e4480f Compare February 9, 2024 19:09

openshift-ci bot assigned sdodson Feb 9, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2024

openshift-merge-bot bot merged commit f81d52a into openshift:master Feb 9, 2024

wking deleted the AzureRegistryImagePreservation-active-registry-use branch February 9, 2024 20:09

wking mentioned this pull request May 31, 2024

SDN-4871: blocked-edges/4.14.*-OVNInterConnectTransitionIPsec: Declare risk #5334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763

IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763

Uh oh!

wking commented Feb 9, 2024

Uh oh!

openshift-ci-robot commented Feb 9, 2024 •

edited by openshift-ci bot

Loading

Uh oh!

wking commented Feb 9, 2024

Uh oh!

wking commented Feb 9, 2024 •

edited

Loading

Uh oh!

bennerv commented Feb 9, 2024

Uh oh!

wking commented Feb 9, 2024

Uh oh!

sdodson commented Feb 9, 2024

Uh oh!

openshift-ci bot commented Feb 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763

IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763

Uh oh!

Conversation

wking commented Feb 9, 2024

Uh oh!

openshift-ci-robot commented Feb 9, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Feb 9, 2024

Uh oh!

wking commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bennerv commented Feb 9, 2024

Uh oh!

wking commented Feb 9, 2024

Uh oh!

sdodson commented Feb 9, 2024

Uh oh!

openshift-ci bot commented Feb 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented Feb 9, 2024 •

edited by openshift-ci bot

Loading

wking commented Feb 9, 2024 •

edited

Loading