-
Notifications
You must be signed in to change notification settings - Fork 215
OCPBUGS-2592: Fix Deployment hotloops after incorrect SeccompProfile comparison #855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-2592: Fix Deployment hotloops after incorrect SeccompProfile comparison #855
Conversation
|
@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Skipping CI for Draft Pull Request. |
|
/test all |
1015053 to
24d0a0e
Compare
|
/test all |
24d0a0e to
9befb83
Compare
|
/test all |
|
@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
9befb83 to
e9a3f24
Compare
|
/jira refresh |
|
@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-agnostic-upgrade-into-change |
The problem was identified to be a broken substitution of internal load balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756) and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)). CVO injects the LB hostname in the [`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19) fine, but then the deployment gets applied in [`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17) and the `EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar` chain stomps the updated value in `required` by the old value from `existing` and reverts the injection in this way This behavior was added intentionally in openshift#559 as a part of a fix for various hot-looping issues. The substitution apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)). I have tested removing the special handling `KUBERNETES_SERVICE_HOST` thoroughly, and saw no problematic behavior. After fixing other hot-looping problems in openshift#855 to eliminate noise, no new hot-loops occurs with `KUBERNETES_SERVICE_HOST` handling removed.
The problem was identified to be a broken substitution of internal load balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756) and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)). CVO injects the LB hostname in the [`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19) fine, but then the deployment gets applied in [`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17) and the `EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar` chain stomps the updated value in `required` by the old value from `existing` and reverts the injection in this way This behavior was added intentionally in openshift#559 as a part of a fix for various hot-looping issues. The substitution apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)). I have tested removing the special handling `KUBERNETES_SERVICE_HOST` thoroughly, and saw no problematic behavior. After fixing other hot-looping problems in openshift#855 to eliminate noise, no new hot-loops occurs with `KUBERNETES_SERVICE_HOST` handling removed.
…ddress The problem was identified to be a broken substitution of internal load balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756) and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)). CVO injects the LB hostname in the [`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19) fine, but then the deployment gets applied in [`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17) and the `EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar` chain stomps the updated value in `required` by the old value from `existing` and reverts the injection in this way This behavior was added intentionally in openshift#559 as a part of a fix for various hot-looping issues. The substitution apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)). I have tested removing the special handling `KUBERNETES_SERVICE_HOST` thoroughly, and saw no problematic behavior. After fixing other hot-looping problems in openshift#855 to eliminate noise, no new hot-loops occurs with `KUBERNETES_SERVICE_HOST` handling removed.
|
/hold Let me try reintroducing a missed defaulting hotloop problem from #559 to show what would the log show |
|
@jottofar this is shown for a reintroduced missed defaulting bug. No noise. I1025 17:54:21.821291 1 apps.go:41] Updating Deployment openshift-marketplace/marketplace-operator due to diff: &v1.Deployment{
TypeMeta: {},
ObjectMeta: {Name: "marketplace-operator", Namespace: "openshift-marketplace", UID: "f3ff38dc-65f2-4f85-b754-e09064d8e749", ResourceVersion: "24720", ...},
Spec: v1.DeploymentSpec{
Replicas: &1,
Selector: &{MatchLabels: {"name": "marketplace-operator"}},
Template: v1.PodTemplateSpec{
ObjectMeta: {Labels: {"name": "marketplace-operator"}, Annotations: {"target.workload.openshift.io/management": `{"effect": "PreferredDuringScheduling"}`}},
Spec: v1.PodSpec{
Volumes: {{Name: "marketplace-trusted-ca", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: "marketplace-trusted-ca"}, Items: {{Key: "ca-bundle.crt", Path: "tls-ca-bundle.pem"}}, DefaultMode: &420, Optional: &true}}}, {Name: "marketplace-operator-metrics", VolumeSource: {Secret: &{SecretName: "marketplace-operator-metrics", DefaultMode: &420}}}},
InitContainers: nil,
Containers: []v1.Container{
{
... // 9 identical fields
VolumeMounts: {{Name: "marketplace-trusted-ca", MountPath: "/etc/pki/ca-trust/extracted/pem/"}, {Name: "marketplace-operator-metrics", MountPath: "/var/run/secrets/serving-cert"}},
VolumeDevices: nil,
LivenessProbe: &v1.Probe{
ProbeHandler: {HTTPGet: &{Path: "/healthz", Port: {IntVal: 8080}, Scheme: "HTTP"}},
InitialDelaySeconds: 0,
- TimeoutSeconds: 1,
+ TimeoutSeconds: 0,
- PeriodSeconds: 10,
+ PeriodSeconds: 0,
- SuccessThreshold: 1,
+ SuccessThreshold: 0,
- FailureThreshold: 3,
+ FailureThreshold: 0,
TerminationGracePeriodSeconds: nil,
},
ReadinessProbe: &v1.Probe{
ProbeHandler: {HTTPGet: &{Path: "/healthz", Port: {IntVal: 8080}, Scheme: "HTTP"}},
InitialDelaySeconds: 0,
- TimeoutSeconds: 1,
+ TimeoutSeconds: 0,
- PeriodSeconds: 10,
+ PeriodSeconds: 0,
- SuccessThreshold: 1,
+ SuccessThreshold: 0,
- FailureThreshold: 3,
+ FailureThreshold: 0,
TerminationGracePeriodSeconds: nil,
},
StartupProbe: nil,
Lifecycle: nil,
... // 7 identical fields
},
},
EphemeralContainers: nil,
RestartPolicy: "Always",
... // 32 identical fields
},
},
Strategy: {Type: "RollingUpdate", RollingUpdate: &{MaxUnavailable: &{Type: 1, StrVal: "25%"}, MaxSurge: &{Type: 1, StrVal: "25%"}}},
MinReadySeconds: 0,
... // 3 identical fields
},
Status: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},
} |
Previously we logged the diff between `existing` and `required`, but `existing` already contains the result of the merge and therefore all relevant bits from `required` are already identical and will never be logged, so the logged diff is pretty much always just misleading. Instead, save the original existing structure and compare it to the result of the merge. Also, correctly handle the empty diff shown on coding errors in `EnsureDeployment` when there is no relevant difference but somehow `modified` still was set to `true`.
`ensureSeccompProfile` was comparing a pointer to a struct and hence always considered the two `SeccompProfile`s to be different, set `modified` to true and triggered the update path for possibly unchanged `Deployments` Resolves: OCPBUGS-2592
For some reason the API/client sometimes returns a pointer to an empty `PodSecurityContext` struct instead of a `nil` pointer. When our desired state is `nil`, this leads to a NOP update and a hotloop. If the desired state is `nil`, do not modify the existing state if it is either `nil` or equal to an empty struct.
3ef2f73 to
60ea731
Compare
|
/hold cancel Verified the logging behavior for omission hotlooop problems (above), left out the code cleanup commit until we have team consensus for this kind of things. |
|
/test unit |
…ddress The problem was identified to be a broken substitution of internal load balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756) and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)). CVO injects the LB hostname in the [`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19) fine, but then the deployment gets applied in [`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17) and the `EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar` chain stomps the updated value in `required` by the old value from `existing` and reverts the injection in this way This behavior was added intentionally in openshift#559 as a part of a fix for various hot-looping issues. The substitution apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)). I have tested removing the special handling `KUBERNETES_SERVICE_HOST` thoroughly, and saw no problematic behavior. After fixing other hot-looping problems in openshift#855 to eliminate noise, no new hot-loops occurs with `KUBERNETES_SERVICE_HOST` handling removed.
|
This a flake du jour, should be fixed now: /retest |
|
@petr-muller: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@petr-muller: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-2592 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…ddress The problem was identified to be a broken substitution of internal load balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756) and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)). CVO injects the LB hostname in the [`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19) fine, but then the deployment gets applied in [`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17) and the `EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar` chain stomps the updated value in `required` by the old value from `existing` and reverts the injection in this way This behavior was added intentionally in openshift#559 as a part of a fix for various hot-looping issues. The substitution apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)). I have tested removing the special handling `KUBERNETES_SERVICE_HOST` thoroughly, and saw no problematic behavior. After fixing other hot-looping problems in openshift#855 to eliminate noise, no new hot-loops occurs with `KUBERNETES_SERVICE_HOST` handling removed.
…ddress The problem was identified to be a broken substitution of internal load balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756) and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)). CVO injects the LB hostname in the [`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19) fine, but then the deployment gets applied in [`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17) and the `EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar` chain stomps the updated value in `required` by the old value from `existing` and reverts the injection in this way This behavior was added intentionally in openshift#559 as a part of a fix for various hot-looping issues. The substitution apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)). I have tested removing the special handling `KUBERNETES_SERVICE_HOST` thoroughly, and saw no problematic behavior. After fixing other hot-looping problems in openshift#855 to eliminate noise, no new hot-loops occurs with `KUBERNETES_SERVICE_HOST` handling removed.
Found and fixed two causes of hotlooping on Deployments after improving the relevant logging:
ensureSeccompProfilewas comparing a pointer to a struct and hence always considered the twoSeccompProfiles to be different, setmodifiedto true and triggered the update path for possibly unchangedDeployments.:.spec.template.spec.securityContextis nil, the API server actually returns aDeploymentwhere the relevant field is notnil, but a pointer to an empty struct (semantically equivalent):Updating Deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator due to diff: &v1.Deployment{ TypeMeta: {}, ... Spec: v1.PodSpec{ ... // 15 identical fields HostIPC: false, ShareProcessNamespace: nil, - SecurityContext: s"&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:nil,}", + SecurityContext: nil, ImagePullSecrets: nil, Hostname: "", ... // 17 identical fieldsResolves: OCPBUGS-2592
I have included some minor code cleanups done by my IDE in the tests in a separate commit, let me know if you want me to remove that.