OCPBUGS-65514: [4.20] status manager: remove managedFields for deleted zone upon zone deletion#2855
Conversation
|
@ricky-rav: This pull request references Jira Issue OCPBUGS-65514, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload 4.20 ci blocking |
|
@ricky-rav: trigger 5 job(s) of type blocking for the ci release of OCP 4.20
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/828c5050-bfa8-11f0-8113-baac25834d05-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.20
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/828c5050-bfa8-11f0-8113-baac25834d05-1 |
|
/retest-required |
|
/payload 4.20 nightly blocking |
|
@ricky-rav: trigger 13 job(s) of type blocking for the nightly release of OCP 4.20
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7cd628d0-c097-11f0-8a66-44f6d631449c-0 |
|
/retest-required |
|
/payload periodic-ci-openshift-release-master-ci-4.20-e2e-aws-upgrade-ovn-single-node |
|
@ricky-rav: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
|
/test periodic-ci-openshift-release-master-ci-4.20-e2e-aws-upgrade-ovn-single-node |
|
@ricky-rav: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/payload 4.20 nightly blocking |
|
@ricky-rav: trigger 13 job(s) of type blocking for the nightly release of OCP 4.20
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b02258b0-c3a0-11f0-9d3d-46bd6fc0c28a-0 |
|
/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial-3of3 |
|
@ricky-rav: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ecfcd790-c401-11f0-9b86-481881d3dad0-0 |
EgressFirewall objects were retaining managedFields entries for nodes that had been deleted. When a node was deleted, the cleanup logic would apply an empty status object using the deleted node as the field manager. This incorrectly signaled to the API server that the manager was now managing an empty status, leaving a stale entry in managedFields. This change corrects the cleanup logic in cleanupStatus. The manager for the deleted node now applies an EgressFirewall configuration that completely omits the status field. This correctly signals that the manager is giving up ownership of the field to the server-side apply mechanism, causing the API server to remove the manager's entry from managedFields. This prevents the buildup of stale data in etcd for large clusters with frequent node churn. Applying the same logic to the other resource types using status manager: ANP, APBRoute, EgressQoS, NetworkQoS, EgressService. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com> (cherry picked from commit e3863d9)
Resources with a condition-based status (EgressQoS, NetworkQoS) store the zone name
in the condition Type field ("Ready-In-Zone-$zoneName"), but not in the
message field. This caused cleanup to fail because GetZoneFromStatus()
couldn't extract the zone name from the message.
Fix this by transforming the output of getMessages() by
extracting the zone from the condition and prepending it to the returned message:
"$zoneName: message", matching the format used by message-based resources (EgressFirewalls, AdminPolicyBasedExternalRoutes).
This also fixes needsUpdate(), which now properly detects zone-specific changes, since it compares messages that include the zone name.
Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
(cherry picked from commit d9ae873)
When zones are deleted, empty ApplyStatus patches are sent to remove status ownership. Due to a previous bug, these patches left behind stale managedFields entries with signature {"f:status":{}}.
This commit adds a one-time startup cleanup that detects and removes these stale entries by checking if managedFields have an empty status and belong to zones that no longer exist. The purpose is to distinguish managedFields that belong to deleted zones from managedFields that belong to external clients (e.g. kubectl). The cleanup runs once when the status manager starts and zones are first discovered.
Also added unit test to verify the startup cleanup logic.
Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com>
(cherry picked from commit 5597bf8)
ANP/BANP don't use a typed status manager, let's add a startup clean up explicitly to remove any stale managed fields that might be present from previous versions. Signed-off-by: Riccardo Ravaioli <rravaiol@redhat.com> (cherry picked from commit 0a004f3)
e33ba3d to
49b03c0
Compare
|
/retest-required |
|
/test e2e-gcp-ovn-techpreview |
|
/retest-required |
2 similar comments
|
/retest-required |
|
/retest-required |
|
/override ci/prow/e2e-aws-ovn-windows |
|
@jcaamano: Overrode contexts on behalf of jcaamano: ci/prow/e2e-aws-ovn-windows DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@ricky-rav: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcaamano, ricky-rav The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@huiran0826: This pull request references Jira Issue OCPBUGS-65514, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified by @huiran0826 |
|
@huiran0826: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1b4dc2f
into
openshift:release-4.20
|
@ricky-rav: Jira Issue Verification Checks: Jira Issue OCPBUGS-65514 Jira Issue OCPBUGS-65514 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.20.0-0.nightly-2025-11-29-002756 |
No conflicts: