You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's possible that a perfectly timed removal of a Node from the cluster can result in leaving an openscap-pod stuck in pending forever.
If a Node removal is triggered as a new ComplianceScan is being triggered, it is highly likely that the scan pod will never be allowed to complete a scan. Even if the scan pod does get scheduled to the node, but is immediately drained as the node is being deleted, there won't be any results from the scan pod. Seemingly, CO will recreate the pod, for the now removed node, and that pod will be stuck in pending forever.
Then on subsequent scheduled scans, the node list won't have the removed node and CO won't remove/cleanup the scan pod stuck pending due to how it delete scan pods.
The scan pods do not define an OwnerRef to the ComplianceScan object, so deleting the ComplianceScan has no effect in cleaning up the pending pod either.
Noting that we enable timeout retries on scans and do have debug enabled. Looking at the ComplianceScan handler, it could be that due to having debug: true set, it's skipping the cleanup logic with the node list created at the scan start time, thus missing the opportunity to cleanup the stuck pending scan pod.
The text was updated successfully, but these errors were encountered:
OpenShift 4.12.33
compliance-operator.v0.1.61
It's possible that a perfectly timed removal of a Node from the cluster can result in leaving an
openscap-pod
stuck in pending forever.If a Node removal is triggered as a new ComplianceScan is being triggered, it is highly likely that the scan pod will never be allowed to complete a scan. Even if the scan pod does get scheduled to the node, but is immediately drained as the node is being deleted, there won't be any results from the scan pod. Seemingly, CO will recreate the pod, for the now removed node, and that pod will be stuck in pending forever.
Then on subsequent scheduled scans, the node list won't have the removed node and CO won't remove/cleanup the scan pod stuck pending due to how it delete scan pods.
The scan pods do not define an OwnerRef to the ComplianceScan object, so deleting the ComplianceScan has no effect in cleaning up the pending pod either.
Noting that we enable timeout retries on scans and do have debug enabled. Looking at the ComplianceScan handler, it could be that due to having
debug: true
set, it's skipping the cleanup logic with the node list created at the scan start time, thus missing the opportunity to cleanup the stuck pending scan pod.The text was updated successfully, but these errors were encountered: