-
Notifications
You must be signed in to change notification settings - Fork 65
IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IR-461: blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use #4763
Conversation
|
@wking: This pull request references IR-461 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
251d5ae to
f6eef5d
Compare
|
Testing the new PromQL in 4.13.33's azure-ovn-upgrade-4.13-micro run in PromeCIeus, it notices some registry use late in the update after the nodes rolled: |
|
And testing vs. 4.13.33's aws-sdn-upgrade-4.13-micro run in PromeCIeus, it sees both "no registry activity" and "detected registry activity", but knows it is not exposed in either case because it is AWS and not Azure: |
…registry use Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at imageregistry_request_duration_seconds_count (which only records registry activity like BlobStore.Create) as a sign that the registry is actively used to handle images. Compare with imageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use". In 4.14, the registry component became an optional component [1], but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the 'from: 4[.]13[.].*'. So we don't have to worry about "what if imageregistry_http_request_duration_seconds_count is not available?" cases outside of the "registry metrics collection is broken" case. The registry does not persist these *_count values across container restarts, so there is a window of exposure like: 1. Registry activity increments imageregistry_request_duration_seconds_count counters. 2. PromQL correctly identifies "this cluster is using its registry". 3. Registry container restarts, zeroing its *_count buckets. 4. If _all_ registry containers restart before there is new registry activity, the imageregistry_request_duration_seconds_count signal may go away. 5. New registry activity repopulates imageregistry_request_duration_seconds_count, return to step 1. The chances that all registry pods restart at the same time seems low, outside of: * Updates or other registry-hosting-node reboots, where clusters that actively use the in-cluster registry are likely to have pods consuming images from that registry need re-pulling, which would drive fresh registry traffic and repopulate imageregistry_request_duration_seconds_count. * Registry reconfiguration, which would not drive fresh registry traffic, but which seems like unlikely admin activity in a cluster where the internal-registry is not being used. I'm also using the max_over_time with a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulate imageregistry_request_duration_seconds_count". ARO is comfortable waiting for the bugfix to ship, and does not want the complicated "registry in use?" detection, so they have an 'or' entry in that stanza to ensure it will always match on ARO clusters. And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to. I'm also adding _id="" placeholders for HyperShift, see 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, openshift#3591) for more on that. Generated by editing 4.14.1's by hand and copying out to the others with: $ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done [1]: https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-optional-capabilities
f6eef5d to
5e4480f
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The PromQL is: * If we know IPsec is enabled, we're exposed, or... * If we know IPsec is not enabled, we're not exposed, or... * If are OVN, but aren't sure if IPsec is enabled (e.g. ovnkube_master_ipsec_enabled scraping is failing), we might be exposed. -1 will cause an evaluation failure [1]. Or... * If we know we are not OVN, we are not exposed, or... * If we aren't sure we're on OVN (e.g. apiserver_storage_objects scraping is failing), we might be exposed. Returning no time series will cause an evaluation failure. The label_replace makes it easier to tell when we're in the "we know IPsec is not enabled" case. The max_over_time avoids hiccuping if metrics are interrupted; see 5e4480f (blocked-edges/4.14.*-AzureRegistryImagePreservation: Look for active registry use, 2024-02-09, #4763). I'm also adding _id="" to the queries as a pattern to support HyperShift and other systems that could query the cluster's data out of a PromQL engine that stored data for multiple clusters. More context in 5cb2e93 (blocked-edges/4.11.*-KeepalivedMulticastSkew: Explicit _id="", 2023-05-09, #3591). Generated by creating the 4.14.0 file by hand and copying it out to the other 4.14 releases with: $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.14&arch=amd64' | jq -r '.nodes[] | .version' | grep '^4[.]14[.]' | grep -v '^4[.]14[.]0$' | while read VERSION; do sed "s/4.14.0/${VERSION}/" blocked-edges/4.14.0-OVNInterConnectTransitionIPsec.yaml > "blocked-edges/${VERSION}-OVNInterConnectTransitionIPsec.yaml"; done [1]: https://github.com/openshift/enhancements/blob/4668a0825c59739dfafd2ae661c16cf30f540946/enhancements/update/targeted-update-edge-blocking.md?plain=1#L119






Some clusters do not excercise the internal image registry, and we do not want to bother those admins about this risk. Look at
imageregistry_request_duration_seconds_count(which only records registry activity likeBlobStore.Create) as a sign that the registry is actively used to handle images. Compare withimageregistry_http_request_duration_seconds_count, which the registry serves even when it is not actively being used to manage images, as a sign that registry metrics are working, to distinguish between "we know the registry is not being used" and "registry metrics collection is broken, and we're blind to registry use".In 4.14, the registry component became an optional component, but it is not optional in 4.13. And it's 4.13 clusters evaluating this PromQL because of the
from: 4[.]13[.].*. So we don't have to worry about "what ifimageregistry_http_request_duration_seconds_countis not available?" cases outside of the "registry metrics collection is broken" case.The registry does not persist these
*_countvalues across container restarts, so there is a window of exposure like:imageregistry_request_duration_seconds_countcounters.*_countbuckets.imageregistry_request_duration_seconds_countsignal may go away.imageregistry_request_duration_seconds_count, return to step 1.The chances that all registry pods restart at the same time seems low, outside of:
imageregistry_request_duration_seconds_count.I'm also using the
max_over_timewith a 1 hour lookback to try to smear over small durations of "waiting for new registry activity to repopulateimageregistry_request_duration_seconds_count".And now that the PromQL is getting pretty complicated, I'm also adjusting label management so it's easier to see from the result which clauses are matching for a particular cluster, in case we need to debug why we're matching when we don't expect to, or not matching when we do expect to.
I'm also adding
_id=""placeholders for HyperShift, see 5cb2e93 (#3591) for more on that.Generated by editing 4.14.1's by hand and copying out to the others with:
$ ls blocked-edges/*AzureRegistryImagePreservation.yaml | grep -v '/4[.]14[.]1-A' | while read X; do TO="$(grep '^to: ' "${X}")"; sed "s/^to: .*/${TO}/" blocked-edges/4.14.1-AzureRegistryImagePreservation.yaml > "${X}"; done