Skip to content

Conversation

@gabemontero
Copy link
Contributor

@gabemontero gabemontero commented Mar 11, 2020

per recent discussion between @dmage, @adambkaplan, and myself
the image signature controller in OCM is reaching out to the registries like registry.redhat.io and not using the global proxy if set

see https://github.com/openshift/cluster-version-operator/blob/611490ff448962c70ab29a90764a6c30d4ee87dc/lib/resourcebuilder/apps.go#L55

the bug originator manually set the proxy ENVs on the OCM and saw things works

@bparees @mfojtik @soltysh FYI

@openshift-ci-robot
Copy link
Contributor

@gabemontero: This pull request references Bugzilla bug 1805168, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1805168: add inject-proxy env annotation to daemonset so CVO injects proxy env…

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 11, 2020
@gabemontero
Copy link
Contributor Author

existing e2e's are green at least

/assign @bparees
/assign @adambkaplan

@bparees
Copy link
Contributor

bparees commented Mar 12, 2020

I don't think this is the right solution.

the CVO injection is only for operators. As in "I need the CVO to inject these env vars into my operators' deployment because the CVO owns my deployment and only it can touch it".

You(the OCM operator) own the OCM's deployment, you should be setting the proxy envs on the OCM's deployment and reconciling them as they change (i.e. OCM operator needs to watch the proxy config and propagate that config to the OCM deployment)

@bparees
Copy link
Contributor

bparees commented Mar 12, 2020

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2020
@gabemontero
Copy link
Contributor Author

I don't think this is the right solution.

the CVO injection is only for operators. As in "I need the CVO to inject these env vars into my operators' deployment because the CVO owns my deployment and only it can touch it".

You(the OCM operator) own the OCM's deployment, you should be setting the proxy envs on the OCM's deployment and reconciling them as they change (i.e. OCM operator needs to watch the proxy config and propagate that config to the OCM deployment)

OK a bigger change but certainly not onerours. @adambkaplan 's suggestion of the annotation was trivial enough that at least proposing it to a bigger audience for evaluation seemed fine.

There is also the possibility of the OCM, which is already watching proxy config, setting the envs on itself. While less expensive, that probably subverts our "what an operator does and what an operand does" conventions. But certainly voice any opinions there @bparees if you like.

@bparees
Copy link
Contributor

bparees commented Mar 12, 2020

There is also the possibility of the OCM, which is already watching proxy config, setting the envs on itself. While less expensive, that probably subverts our "what an operator does and what an operand does" conventions. But certainly voice any opinions there @bparees if you like.

the resource needs to be owned/written to by a single writer. for the OCM daemonset resource, that's the OCM operator. Otherwise you risk two things fighting w/ each other about what the correct state of the resource should be.

@gabemontero
Copy link
Contributor Author

There is also the possibility of the OCM, which is already watching proxy config, setting the envs on itself. While less expensive, that probably subverts our "what an operator does and what an operand does" conventions. But certainly voice any opinions there @bparees if you like.

the resource needs to be owned/written to by a single writer. for the OCM daemonset resource, that's the OCM operator. Otherwise you risk two things fighting w/ each other about what the correct state of the resource should be.

Well I was actually thinking it would be golang os.Setenv() call in the OCM process actually nad would not be manipulating the DaemonSet at all. It would be dynamically updating the env's the DaemonSet added to the underlying Pod.

But again, probably too off the beaten path ... I just mentioned it cause it popped into my head

I'll start down the path in this repo

@bparees
Copy link
Contributor

bparees commented Mar 12, 2020

Well I was actually thinking it would be golang os.Setenv() call in the OCM process actually nad would not be manipulating the DaemonSet at all. It would be dynamically updating the env's the DaemonSet added to the underlying Pod.

the drawbacks w/ that approach are:

  1. lack of debuggability (you can't just look at the resource to understand what configuration it is using)
  2. I believe the transports+static libs get setup early and read this information just once, so changing the env var will not, in many cases, affect the configuration they are using, so you'd need to restart the process to pick up the change anyway, which means the OCM would have to nuke itself, and then ensure(on restart) the envs got set before any lib+transport initialization

(2) is why we had to make the CVO able to inject proxy config on operator resources, because operators could not solve the problem themselves by dynamically setting their own env vars. You're basically in the same boat, one level deeper.

@bparees
Copy link
Contributor

bparees commented Mar 12, 2020

Note: i think i may have seen another devex PR recently that was adding the CVO proxy injection annotation. I didn't look at it closely... but if it was doing the same thing as this one (adding the injection annotation to an operand resource, not an operator resource) it should be reverted and redone per the approach i've outlined in the discussion here.

@gabemontero
Copy link
Contributor Author

Note: i think i may have seen another devex PR recently that was adding the CVO proxy injection annotation. I didn't look at it closely... but if it was doing the same thing as this one (adding the injection annotation to an operand resource, not an operator resource) it should be reverted and redone per the approach i've outlined in the discussion here.

since @adambkaplan suggested the annotation to me, maybe he know which PR you are referring to @bparees

otherwise, I'll refactor this PR to have OCM-O watch global proxy and update the OCM daemonset's env as you previously mentioned in the classic operator/operand mgmt pattern

@bparees
Copy link
Contributor

bparees commented Mar 12, 2020

btw if the operand has to consume the proxy it also needs to consume the proxy CAs, so there's more that needs to be done here.

the operand has to mount a configmap that's annotated for the proxy CA injection, and the operator has to watch that configmap and redeploy the operand if the configmap is updated w/ new CA content.

@gabemontero
Copy link
Contributor Author

btw if the operand has to consume the proxy it also needs to consume the proxy CAs, so there's more that needs to be done here.

the operand has to mount a configmap that's annotated for the proxy CA injection, and the operator has to watch that configmap and redeploy the operand if the configmap is updated w/ new CA content.

yep true as well ... so yeah the OCM-O is already creating a proxy injected CM for the CAs and is watching it as part of the build support ... the daemonset update would have to include mounting that CA in addition to the envs

And I suppose the OCM would have to be updated to copy the cert from the k8s mount point to the well known location under /etc just like the openshift/builder image does today

@bparees
Copy link
Contributor

bparees commented Mar 12, 2020

And I suppose the OCM would have to be updated to copy the cert from the k8s mount point to the well known location under /etc just like the openshift/builder image does today

yes, or mount the CM there directly. (The CM should contain all the CAs you need, including system trusts, so it should be ok to just mount over top of the image's own CAs)

@adambkaplan
Copy link
Contributor

yes, or mount the CM there directly. (The CM should contain all the CAs you need, including system trusts, so it should be ok to just mount over top of the image's own CAs)

I concur that mounting directly is the way to go for OCM. Builds are kind of a special case because we have additionalTrustedCAs that are used only for pulling/pushing images. Same deal with the image registry.

@soltysh
Copy link

soltysh commented Mar 16, 2020

I think you need something similar to openshift/cluster-kube-controller-manager-operator#285 and openshift/cluster-kube-controller-manager-operator#325 to be able to consume and use the proxy as far as I can tell.

@gabemontero
Copy link
Contributor Author

I think you need something similar to openshift/cluster-kube-controller-manager-operator#285 and openshift/cluster-kube-controller-manager-operator#325 to be able to consume and use the proxy as far as I can tell.

thanks for the pointers @soltysh

I've got some changes on my laptop that share some commonality with those PRs

I'm wrapping up some unit tests and should have a PR up before my EOB close today.

@gabemontero gabemontero changed the title Bug 1805168: add inject-proxy env annotation to daemonset so CVO injects proxy env… Bug 1805168: inject global proxy envs,ca into OCM daemonset Mar 16, 2020
@openshift-ci-robot
Copy link
Contributor

@gabemontero: This pull request references Bugzilla bug 1805168, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1805168: inject global proxy envs,ca into OCM daemonset

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabemontero
Copy link
Contributor Author

/hold cancel

PR switched over to a proxy watch, config observation, and injection of envs/ca into OCM daemonset via operator

ptal

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 16, 2020
Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2020
@adambkaplan
Copy link
Contributor

@gabemontero looks good to me, squash commits and we can make this official.

@gabemontero
Copy link
Contributor Author

10-4 @adambkaplan @soltysh thanks ..

@gabemontero
Copy link
Contributor Author

... and commits squashed/pushed @adambkaplan @soltysh

and thanks @bparees for all the help

@gabemontero
Copy link
Contributor Author

error: could not run steps: step e2e-aws failed: failed to acquire lease: status 503 Service Unavailable, status code 503

error: could not run steps: step e2e-aws-operator failed: failed to acquire lease: status 503 Service Unavailable, status code 503

error: could not run steps: step e2e-aws-upgrade failed: failed to acquire lease: status 503 Service Unavailable, status code 503

/retest

@gabemontero
Copy link
Contributor Author

green tests @adambkaplan @soltysh @bparees

can somebody lgtm ?

configobservation.Listers{
ImageConfigLister: configInformers.Config().V1().Images().Lister(),
BuildConfigLister: configInformers.Config().V1().Builds().Lister(),
ProxyLister: configInformers.Config().V1().Proxies().Lister(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we install event handlers to notice changes and update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are absolutely right @sttts

though this particular piece is a leftover from the confusion over when to use observe_config_controller.go vs. operator.go and its calls to the sync_*.go file

so I think we need
a) cleanup and full revert of the changes here in observe_config_controller.go
b) and install the event handlers as you say around here:

proxyLister proxyvclient1.ProxyLister,
kubeInformersForOpenshiftControllerManager informers.SharedInformerFactory,
operatorConfigClient operatorclientv1.OperatorV1Interface,
kubeClient kubernetes.Interface,
recorder events.Recorder,
) *OpenShiftControllerManagerOperator {
c := &OpenShiftControllerManagerOperator{
targetImagePullSpec: targetImagePullSpec,
operatorConfigClient: operatorConfigClient,
proxyLister: proxyLister,
kubeClient: kubeClient,
queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "KubeApiserverOperator"),
rateLimiter: flowcontrol.NewTokenBucketRateLimiter(0.05 /*3 per minute*/, 4),
recorder: recorder,
}
operatorConfigInformer.Informer().AddEventHandler(c.eventHandler())
kubeInformersForOpenshiftControllerManager.Core().V1().ConfigMaps().Informer().AddEventHandler(c.eventHandler())
kubeInformersForOpenshiftControllerManager.Core().V1().ServiceAccounts().Informer().AddEventHandler(c.eventHandler())
kubeInformersForOpenshiftControllerManager.Core().V1().Services().Informer().AddEventHandler(c.eventHandler())
kubeInformersForOpenshiftControllerManager.Apps().V1().Deployments().Informer().AddEventHandler(c.eventHandler())

thanks ... getting started

return required, false, err
}
}
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else {

}
if httpsProxySet && !hasHTTPSProxy {
newEnvs = append(newEnvs, corev1.EnvVar{Name: "HTTPS_PROXY", Value: proxyCfg.Status.HTTPSProxy})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

newEnvs := []corev1.EnvVar{}
for _, env := range c.Env {
	name := strings.TrimSpace(env.Name)
	switch name {
	case "HTTPS_PROXY":
		if len(proxyCfg.Status.HTTPSProxy) == 0 {
			continue
		}
		env.Value = proxyCfg.Status.HTTPSProxy
	case ...:
	case ....:
	}
	newEnvs = append(newEnvs, env)
}
forceRollout = forceRollout || !reflect.DeepEqual(newEnvs, c.Env)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to tweak it slightly to deal with an empty c.Env but I agree your form is better ... will be pushing new commit soon

@gabemontero
Copy link
Contributor Author

updates pushed @stts thanks again

@gabemontero
Copy link
Contributor Author

broad failures across all suites waiting on expected SAs in test namespaces... as an example:

Mar 26 16:37:05.309: INFO: Creating project "e2e-test-s2i-build-quota-btfrh"
Mar 26 16:37:05.725: INFO: Waiting on permissions in project "e2e-test-s2i-build-quota-btfrh" ...
Mar 26 16:37:05.752: INFO: Waiting for ServiceAccount "default" to be provisioned...
Mar 26 16:39:03.083: INFO: Waiting for ServiceAccount "deployer" to be provisioned...
[AfterEach] [sig-builds][Feature:Builds] s2i build with a quota

The CVO has a nasty looking extra condition on it:

                   {
                        "lastTransitionTime": "2020-03-26T15:39:58Z",
                        "message": "Unable to retrieve available updates: currently installed version 0.0.1-2020-03-26-153053 not found in the \"stable-4.5\" channel",
                        "reason": "VersionNotFound",
                        "status": "False",
                        "type": "RetrievedUpdates"
                    }

punting

/test e2e-aws

@gabemontero
Copy link
Contributor Author

I see a lot of throttling notifications trying to access the api server in the OCM-O pod logs around the times of the e2e-aws-operator failures ... my assumption is that messed with those e2e's but am getting up to speed on those tests, looking around the artifacts some more before retesting that one.

@gabemontero
Copy link
Contributor Author

I see a lot of throttling notifications trying to access the api server in the OCM-O pod logs around the times of the e2e-aws-operator failures ... my assumption is that messed with those e2e's but am getting up to speed on those tests, looking around the artifacts some more before retesting that one.

Yeah even the basic, very first test that just reads the OCM clusteroperator suffered a TO, but if you look at the artifacts, that object showed everything was well, and was last updated before the test started.

/test e2e-aws-operator

@gabemontero
Copy link
Contributor Author

/retest

1 similar comment
@gabemontero
Copy link
Contributor Author

/retest

@gabemontero
Copy link
Contributor Author

/test e2e-aws

@gabemontero
Copy link
Contributor Author

pushed fix for e2e-aws-operator that should better account for a false positive we were getting with the reflect.DeepEquals change from the last iteration

@gabemontero
Copy link
Contributor Author

CI hiccup across most of the board

/retest

@gabemontero
Copy link
Contributor Author

green tests

bump - any of the previous reviewers ready to LGTM ?

Copy link
Contributor

@adambkaplan adambkaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adambkaplan, gabemontero, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [adambkaplan,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 5f3cb00 into openshift:master Mar 31, 2020
@openshift-ci-robot
Copy link
Contributor

@gabemontero: All pull requests linked via external trackers have merged: openshift/cluster-openshift-controller-manager-operator#145. Bugzilla bug 1805168 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1805168: inject global proxy envs,ca into OCM daemonset

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabemontero gabemontero deleted the ocm-proxy-envs branch March 31, 2020 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants