Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Feb 9, 2024

Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with
imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component, but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the from: 4[.]13[.].*. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case.

The registry does not persist these *_count values across container restarts, so there is a window of exposure like:

  1. Registry activity increments imageregistry_request_duration_seconds_count counters.
  2. PromQL correctly identifies "this cluster is using its registry".
  3. Registry container restarts, zeroing its *_count buckets.
  4. If all registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away.
  5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low, outside of:

  • Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count.
  • Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count".

And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

$ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 9, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 9, 2024

@wking: This pull request references IR-461 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set.

Details

In response to this:

Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with
imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component, but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the from: 4[.]13[.].*. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case.

The registry does not persist these *_count values across container restarts, so there is a window of exposure like:

  1. Registry activity increments imageregistry_request_duration_seconds_count counters.
  2. PromQL correctly identifies "this cluster is using its registry".
  3. Registry container restarts, zeroing its *_count buckets.
  4. If all registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away.
  5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low, outside of:

  • Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count.
  • Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count".

And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

$ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2024
@wking wking force-pushed the AzureRegistryImagePreservation-active-registry-use branch from 251d5ae to f6eef5d Compare February 9, 2024 18:24
@wking
Copy link
Member Author

wking commented Feb 9, 2024

Testing the new PromQL in 4.13.33's azure-ovn-upgrade-4.13-micro run in PromeCIeus, it notices some registry use late in the update after the nodes rolled:

image

image

@wking
Copy link
Member Author

wking commented Feb 9, 2024

And testing vs. 4.13.33's aws-sdn-upgrade-4.13-micro run in PromeCIeus, it sees both "no registry activity" and "detected registry activity", but knows it is not exposed in either case because it is AWS and not Azure:

image

…registry use

Some clusters do not excercise the internal image registry, and we do
not want to bother those admins about this risk.  Look at
imageregistry_request_duration_seconds_count (which only records
registry activity like BlobStore.Create) as a sign that the registry
is actively used to handle images.  Compare with
imageregistry_http_request_duration_seconds_count, which the registry
serves even when it is not actively being used to manage images, as a
sign that registry metrics are working, to distinguish between "we
know the registry is not being used" and "registry metrics collection
is broken, and we're blind to registry use".

In 4.14, the registry component became an optional component [1], but
it is not optional in 4.13.  And it's 4.13 clusters evaluating this
PromQL because of the 'from: 4[.]13[.].*'.  So we don't have to worry
about "what if imageregistry_http_request_duration_seconds_count is
not available?" cases outside of the "registry metrics collection is
broken" case.

The registry does not persist these *_count values across container
restarts, so there is a window of exposure like:

1. Registry activity increments
   imageregistry_request_duration_seconds_count counters.
2. PromQL correctly identifies "this cluster is using its registry".
3. Registry container restarts, zeroing its *_count buckets.
4. If _all_ registry containers restart before there is new registry
   activity, the imageregistry_request_duration_seconds_count signal
   may go away.
5. New registry activity repopulates
   imageregistry_request_duration_seconds_count, return to step 1.

The chances that all registry pods restart at the same time seems low,
outside of:

* Updates or other registry-hosting-node reboots, where clusters that
  actively use the in-cluster registry are likely to have pods
  consuming images from that registry need re-pulling, which would
  drive fresh registry traffic and repopulate
  imageregistry_request_duration_seconds_count.
* Registry reconfiguration, which would not drive fresh registry
  traffic, but which seems like unlikely admin activity in a cluster
  where the internal-registry is not being used.

I'm also using the max_over_time with a 1 hour lookback to try to
smear over small durations of "waiting for new registry activity to
repopulate imageregistry_request_duration_seconds_count".

ARO is comfortable waiting for the bugfix to ship, and does not want
the complicated "registry in use?" detection, so they have an 'or'
entry in that stanza to ensure it will always match on ARO clusters.

And now that the PromQL is getting pretty complicated, I'm also
adjusting label management so it's easier to see from the result which
clauses are matching for a particular cluster, in case we need to
debug why we're matching when we don't expect to, or not matching when
we do expect to.

I'm also adding _id="" placeholders for HyperShift, see 5cb2e93
(blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="",
2023-05-09, openshift#3591) for more on that.

Generated by editing 4.14.1's by hand and copying out to the others with:

  $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done

[1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
@bennerv
Copy link
Contributor

bennerv commented Feb 9, 2024

LGTM on aro:
image

@wking wking force-pushed the AzureRegistryImagePreservation-active-registry-use branch from f6eef5d to 5e4480f Compare February 9, 2024 19:09
@wking
Copy link
Member Author

wking commented Feb 9, 2024

And refreshing my 4.13.33 screenshots, now that we've added the ARO condition, I get similar results to what I got before, and they recognize that they aren't ARO clusters:

image

image

@sdodson
Copy link
Member

sdodson commented Feb 9, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit f81d52a into openshift:master Feb 9, 2024
@wking wking deleted the AzureRegistryImagePreservation-active-registry-use branch February 9, 2024 20:09
openshift-merge-bot bot pushed a commit that referenced this pull request May 31, 2024
The PromQL is:

* If we know IPsec is enabled, we're exposed, or...
* If we know IPsec is not enabled, we're not exposed, or...
* If are OVN, but aren't sure if IPsec is enabled
  (e.g. ovnkube_master_ipsec_enabled scraping is failing), we might be
  exposed.  -1 will cause an evaluation failure [1].  Or...
* If we know we are not OVN, we are not exposed, or...
* If we aren't sure we're on OVN (e.g. apiserver_storage_objects
  scraping is failing), we might be exposed.  Returning no time series
  will cause an evaluation failure.

The label_replace makes it easier to tell when we're in the "we know
IPsec is not enabled" case.

The max_over_time avoids hiccuping if metrics are interrupted; see
5e4480f (blocked-edges/4.14.*-AzureRegistryImagePreservation: Look
for active registry use, 2024-02-09, #4763).

I'm also adding _id="" to the queries as a pattern to support
HyperShift and other systems that could query the cluster's data out
of a PromQL engine that stored data for multiple clusters.  More
context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew:
Explicit _id="", 2023-05-09, #3591).

Generated by creating the 4.14.0 file by hand and copying it out to
the other 4.14 releases with:

  $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.]0$' | while read VERSION; do sed "s/4.14.0/${VERSION}/" blocked-edges/4.14.0-OVNInterConnectTransitionIPsec.yaml > "blocked-edges/${VERSION}-OVNInterConnectTransitionIPsec.yaml"; done

[1]: https://github.com/openshift/enhancements/blob/4668a0825c59739dfafd2ae661c16cf30f540946/enhancements/update/targeted-update-edge-blocking.md?plain=1#L119
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants