Skip to content

Conversation

@petr-muller
Copy link
Member

@petr-muller petr-muller commented Oct 19, 2022

Found and fixed two causes of hotlooping on Deployments after improving the relevant logging:

  1. ensureSeccompProfile was comparing a pointer to a struct and hence always considered the two SeccompProfiles to be different, set modified to true and triggered the update path for possibly unchanged Deployments.:
apps.go:43] Updating Deployment openshift-apiserver-operator/openshift-apiserver-operator with empty diff: possible hotloop after wrong comparison
  1. When the desired state of .spec.template.spec.securityContext is nil, the API server actually returns a Deployment where the relevant field is not nil, but a pointer to an empty struct (semantically equivalent):
Updating Deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator due to diff:   &v1.Deployment{                                                               TypeMeta:   {},  
...
        Spec: v1.PodSpec{
          ... // 15 identical fields
          HostIPC:               false,
          ShareProcessNamespace: nil, 
-         SecurityContext:       s"&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:nil,}",
+         SecurityContext:       nil, 
          ImagePullSecrets:      nil, 
          Hostname:              "",  
          ... // 17 identical fields

Resolves: OCPBUGS-2592

I have included some minor code cleanups done by my IDE in the tests in a separate commit, let me know if you want me to remove that.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 19, 2022
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is invalid:

  • expected the bug to target the "4.12.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

ensureSeccompProfile was comparing a pointer to a struct and hence always considered the two SeccompProfiles to be different, set modified to true and triggered the update path for possibly unchanged Deployments.

Resolves: OCPBUGS-2592

Also, I've included several minor cleanups and a relevant logging improvement.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 19, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@petr-muller
Copy link
Member Author

/test all

@petr-muller petr-muller force-pushed the ocpbugs-2592-deployment-hotloops branch from 1015053 to 24d0a0e Compare October 19, 2022 18:30
@petr-muller
Copy link
Member Author

/test all

@petr-muller petr-muller force-pushed the ocpbugs-2592-deployment-hotloops branch from 24d0a0e to 9befb83 Compare October 20, 2022 12:52
@petr-muller
Copy link
Member Author

/test all

@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is invalid:

  • expected the bug to target the "4.12.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Found and fixed two causes of hotlooping on Deployments after improving the relevant logging:

  1. ensureSeccompProfile was comparing a pointer to a struct and hence always considered the two SeccompProfiles to be different, set modified to true and triggered the update path for possibly unchanged Deployments.:
apps.go:43] Updating Deployment openshift-apiserver-operator/openshift-apiserver-operator with empty diff: possible hotloop after wrong comparison
  1. When the desired state of .spec.template.spec.securityContext is nil, the API server actually returns a Deployment where the relevant field is not nil, but a pointer to an empty struct (semantically equivalent):
Updating Deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator due to diff:   &v1.Deployment{                                                               TypeMeta:   {},  
...
       Spec: v1.PodSpec{
         ... // 15 identical fields
         HostIPC:               false,
         ShareProcessNamespace: nil, 
-         SecurityContext:       s"&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:nil,}",
+         SecurityContext:       nil, 
         ImagePullSecrets:      nil, 
         Hostname:              "",  
         ... // 17 identical fields

Resolves: OCPBUGS-2592

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is invalid:

  • expected the bug to target the "4.12.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Found and fixed two causes of hotlooping on Deployments after improving the relevant logging:

  1. ensureSeccompProfile was comparing a pointer to a struct and hence always considered the two SeccompProfiles to be different, set modified to true and triggered the update path for possibly unchanged Deployments.:
apps.go:43] Updating Deployment openshift-apiserver-operator/openshift-apiserver-operator with empty diff: possible hotloop after wrong comparison
  1. When the desired state of .spec.template.spec.securityContext is nil, the API server actually returns a Deployment where the relevant field is not nil, but a pointer to an empty struct (semantically equivalent):
Updating Deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator due to diff:   &v1.Deployment{                                                               TypeMeta:   {},  
...
       Spec: v1.PodSpec{
         ... // 15 identical fields
         HostIPC:               false,
         ShareProcessNamespace: nil, 
-         SecurityContext:       s"&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:nil,}",
+         SecurityContext:       nil, 
         ImagePullSecrets:      nil, 
         Hostname:              "",  
         ... // 17 identical fields

Resolves: OCPBUGS-2592

I have included some minor code cleanups done by my IDE in the tests in a separate commit, let me know if you want me to remove that.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller petr-muller force-pushed the ocpbugs-2592-deployment-hotloops branch from 9befb83 to e9a3f24 Compare October 20, 2022 16:42
@petr-muller
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 20, 2022
@openshift-ci-robot
Copy link
Contributor

@petr-muller: This pull request references Jira Issue OCPBUGS-2592, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @shellyyang1989

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 20, 2022
@petr-muller petr-muller marked this pull request as ready for review October 20, 2022 16:44
@openshift-ci openshift-ci bot requested a review from shellyyang1989 October 20, 2022 16:44
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 20, 2022
@petr-muller
Copy link
Member Author

/test e2e-agnostic-upgrade-into-change

petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Oct 24, 2022
The problem was identified to be a broken substitution of internal load
balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756)
and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)).

CVO injects the LB hostname in the
[`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19)
fine, but then the deployment gets applied in
[`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17)
and the
`EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar`
chain stomps the updated value in `required` by the old value from
`existing` and reverts the injection in this way

This behavior was added intentionally in openshift#559
as a part of a fix for various hot-looping issues. The substitution
apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)).
I have tested removing the special handling `KUBERNETES_SERVICE_HOST`
thoroughly, and saw no problematic behavior. After fixing other
hot-looping problems in openshift#855
to eliminate noise, no new hot-loops occurs with
`KUBERNETES_SERVICE_HOST` handling removed.
petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Oct 24, 2022
The problem was identified to be a broken substitution of internal load
balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756)
and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)).

CVO injects the LB hostname in the
[`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19)
fine, but then the deployment gets applied in
[`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17)
and the
`EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar`
chain stomps the updated value in `required` by the old value from
`existing` and reverts the injection in this way

This behavior was added intentionally in openshift#559
as a part of a fix for various hot-looping issues. The substitution
apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)).
I have tested removing the special handling `KUBERNETES_SERVICE_HOST`
thoroughly, and saw no problematic behavior. After fixing other
hot-looping problems in openshift#855
to eliminate noise, no new hot-loops occurs with
`KUBERNETES_SERVICE_HOST` handling removed.
petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Oct 25, 2022
…ddress

The problem was identified to be a broken substitution of internal load
balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756)
and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)).

CVO injects the LB hostname in the
[`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19)
fine, but then the deployment gets applied in
[`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17)
and the
`EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar`
chain stomps the updated value in `required` by the old value from
`existing` and reverts the injection in this way

This behavior was added intentionally in openshift#559
as a part of a fix for various hot-looping issues. The substitution
apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)).
I have tested removing the special handling `KUBERNETES_SERVICE_HOST`
thoroughly, and saw no problematic behavior. After fixing other
hot-looping problems in openshift#855
to eliminate noise, no new hot-loops occurs with
`KUBERNETES_SERVICE_HOST` handling removed.
@petr-muller
Copy link
Member Author

/hold

Let me try reintroducing a missed defaulting hotloop problem from #559 to show what would the log show

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 25, 2022
@petr-muller
Copy link
Member Author

@jottofar this is shown for a reintroduced missed defaulting bug. No noise.

I1025 17:54:21.821291       1 apps.go:41] Updating Deployment openshift-marketplace/marketplace-operator due to diff:   &v1.Deployment{
  	TypeMeta:   {},
  	ObjectMeta: {Name: "marketplace-operator", Namespace: "openshift-marketplace", UID: "f3ff38dc-65f2-4f85-b754-e09064d8e749", ResourceVersion: "24720", ...},
  	Spec: v1.DeploymentSpec{
  		Replicas: &1,
  		Selector: &{MatchLabels: {"name": "marketplace-operator"}},
  		Template: v1.PodTemplateSpec{
  			ObjectMeta: {Labels: {"name": "marketplace-operator"}, Annotations: {"target.workload.openshift.io/management": `{"effect": "PreferredDuringScheduling"}`}},
  			Spec: v1.PodSpec{
  				Volumes:        {{Name: "marketplace-trusted-ca", VolumeSource: {ConfigMap: &{LocalObjectReference: {Name: "marketplace-trusted-ca"}, Items: {{Key: "ca-bundle.crt", Path: "tls-ca-bundle.pem"}}, DefaultMode: &420, Optional: &true}}}, {Name: "marketplace-operator-metrics", VolumeSource: {Secret: &{SecretName: "marketplace-operator-metrics", DefaultMode: &420}}}},
  				InitContainers: nil,
  				Containers: []v1.Container{
  					{
  						... // 9 identical fields
  						VolumeMounts:  {{Name: "marketplace-trusted-ca", MountPath: "/etc/pki/ca-trust/extracted/pem/"}, {Name: "marketplace-operator-metrics", MountPath: "/var/run/secrets/serving-cert"}},
  						VolumeDevices: nil,
  						LivenessProbe: &v1.Probe{
  							ProbeHandler:                  {HTTPGet: &{Path: "/healthz", Port: {IntVal: 8080}, Scheme: "HTTP"}},
  							InitialDelaySeconds:           0,
- 							TimeoutSeconds:                1,
+ 							TimeoutSeconds:                0,
- 							PeriodSeconds:                 10,
+ 							PeriodSeconds:                 0,
- 							SuccessThreshold:              1,
+ 							SuccessThreshold:              0,
- 							FailureThreshold:              3,
+ 							FailureThreshold:              0,
  							TerminationGracePeriodSeconds: nil,
  						},
  						ReadinessProbe: &v1.Probe{
  							ProbeHandler:                  {HTTPGet: &{Path: "/healthz", Port: {IntVal: 8080}, Scheme: "HTTP"}},
  							InitialDelaySeconds:           0,
- 							TimeoutSeconds:                1,
+ 							TimeoutSeconds:                0,
- 							PeriodSeconds:                 10,
+ 							PeriodSeconds:                 0,
- 							SuccessThreshold:              1,
+ 							SuccessThreshold:              0,
- 							FailureThreshold:              3,
+ 							FailureThreshold:              0,
  							TerminationGracePeriodSeconds: nil,
  						},
  						StartupProbe: nil,
  						Lifecycle:    nil,
  						... // 7 identical fields
  					},
  				},
  				EphemeralContainers: nil,
  				RestartPolicy:       "Always",
  				... // 32 identical fields
  			},
  		},
  		Strategy:        {Type: "RollingUpdate", RollingUpdate: &{MaxUnavailable: &{Type: 1, StrVal: "25%"}, MaxSurge: &{Type: 1, StrVal: "25%"}}},
  		MinReadySeconds: 0,
  		... // 3 identical fields
  	},
  	Status: {ObservedGeneration: 1, Replicas: 1, UpdatedReplicas: 1, ReadyReplicas: 1, ...},
  }

Previously we logged the diff between `existing` and `required`, but
`existing` already contains the result of the merge and therefore all
relevant bits from `required` are already identical and will never be
logged, so the logged diff is pretty much always just misleading.

Instead, save the original existing structure and compare it to the
result of the merge. Also, correctly handle the empty diff shown on
coding errors in `EnsureDeployment` when there is no relevant difference
but somehow `modified` still was set to `true`.
`ensureSeccompProfile` was comparing a pointer to a struct and hence
always considered the two `SeccompProfile`s to be different, set
`modified` to true and triggered the update path for possibly unchanged
`Deployments`

Resolves: OCPBUGS-2592
For some reason the API/client sometimes returns a pointer to an empty
`PodSecurityContext` struct instead of a `nil` pointer. When our desired
state is `nil`, this leads to a NOP update and a hotloop. If the desired
state is `nil`, do not modify the existing state if it is either `nil`
or equal to an empty struct.
@petr-muller petr-muller force-pushed the ocpbugs-2592-deployment-hotloops branch from 3ef2f73 to 60ea731 Compare October 26, 2022 12:24
@petr-muller
Copy link
Member Author

/hold cancel
/label tide/merge-method-squash

Verified the logging behavior for omission hotlooop problems (above), left out the code cleanup commit until we have team consensus for this kind of things.

@openshift-ci openshift-ci bot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Oct 26, 2022
@petr-muller
Copy link
Member Author

/test unit
TestCVO_UpgradeFailedPayloadLoadWithCapsChanges

petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Oct 26, 2022
…ddress

The problem was identified to be a broken substitution of internal load
balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756)
and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)).

CVO injects the LB hostname in the
[`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19)
fine, but then the deployment gets applied in
[`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17)
and the
`EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar`
chain stomps the updated value in `required` by the old value from
`existing` and reverts the injection in this way

This behavior was added intentionally in openshift#559
as a part of a fix for various hot-looping issues. The substitution
apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)).
I have tested removing the special handling `KUBERNETES_SERVICE_HOST`
thoroughly, and saw no problematic behavior. After fixing other
hot-looping problems in openshift#855
to eliminate noise, no new hot-loops occurs with
`KUBERNETES_SERVICE_HOST` handling removed.
@petr-muller
Copy link
Member Author

petr-muller commented Oct 26, 2022

This a flake du jour, should be fixed now:

ImagePullBackOff:  Back-off pulling image "registry.redhat.io/redhat/community-operator-index:v4.12"

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2022

@petr-muller: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jottofar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 27, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2022
@openshift-merge-robot openshift-merge-robot merged commit c58385e into openshift:master Oct 27, 2022
@openshift-ci-robot
Copy link
Contributor

@petr-muller: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-2592 has been moved to the MODIFIED state.

Details

In response to this:

Found and fixed two causes of hotlooping on Deployments after improving the relevant logging:

  1. ensureSeccompProfile was comparing a pointer to a struct and hence always considered the two SeccompProfiles to be different, set modified to true and triggered the update path for possibly unchanged Deployments.:
apps.go:43] Updating Deployment openshift-apiserver-operator/openshift-apiserver-operator with empty diff: possible hotloop after wrong comparison
  1. When the desired state of .spec.template.spec.securityContext is nil, the API server actually returns a Deployment where the relevant field is not nil, but a pointer to an empty struct (semantically equivalent):
Updating Deployment openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator due to diff:   &v1.Deployment{                                                               TypeMeta:   {},  
...
       Spec: v1.PodSpec{
         ... // 15 identical fields
         HostIPC:               false,
         ShareProcessNamespace: nil, 
-         SecurityContext:       s"&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:nil,}",
+         SecurityContext:       nil, 
         ImagePullSecrets:      nil, 
         Hostname:              "",  
         ... // 17 identical fields

Resolves: OCPBUGS-2592

I have included some minor code cleanups done by my IDE in the tests in a separate commit, let me know if you want me to remove that.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller added a commit to petr-muller/cluster-version-operator that referenced this pull request Oct 31, 2022
…ddress

The problem was identified to be a broken substitution of internal load
balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756)
and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)).

CVO injects the LB hostname in the
[`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19)
fine, but then the deployment gets applied in
[`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17)
and the
`EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar`
chain stomps the updated value in `required` by the old value from
`existing` and reverts the injection in this way

This behavior was added intentionally in openshift#559
as a part of a fix for various hot-looping issues. The substitution
apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)).
I have tested removing the special handling `KUBERNETES_SERVICE_HOST`
thoroughly, and saw no problematic behavior. After fixing other
hot-looping problems in openshift#855
to eliminate noise, no new hot-loops occurs with
`KUBERNETES_SERVICE_HOST` handling removed.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Nov 16, 2022
…ddress

The problem was identified to be a broken substitution of internal load
balancer into `KUBERNETES_SERVICE_HOST` by Trevor and David (see my [JIRA comment](https://issues.redhat.com/browse/OCPBUGS-1458?focusedCommentId=21090756&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-21090756)
and related [Slack thread](https://coreos.slack.com/archives/C011CSSPBLK/p1664925995946479?thread_ts=1661182025.992649&cid=C011CSSPBLK)).

CVO injects the LB hostname in the
[`ModifyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourcebuilder/apps.go#L19)
fine, but then the deployment gets applied in
[`ApplyDeployment`](https://github.com/openshift/cluster-version-operator/blob/dc1ad0aef5f3e1b88074448d21445a5bddb6b05b/lib/resourceapply/apps.go#L17)
and the
`EnsureDeployment`->`ensurePodTemplateSpec`->`ensurePodSpec`->`ensureContainers`->`ensureContainer`->`ensureEnvVar`
chain stomps the updated value in `required` by the old value from
`existing` and reverts the injection in this way

This behavior was added intentionally in openshift#559
as a part of a fix for various hot-looping issues. The substitution
apparently caused some hot-looping issues in the past ([slack thread](https://coreos.slack.com/archives/CEGKQ43CP/p1620934857402200?thread_ts=1620895567.367100&cid=CEGKQ43CP)).
I have tested removing the special handling `KUBERNETES_SERVICE_HOST`
thoroughly, and saw no problematic behavior. After fixing other
hot-looping problems in openshift#855
to eliminate noise, no new hot-loops occurs with
`KUBERNETES_SERVICE_HOST` handling removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants