Skip to content

Conversation

@nhosoi
Copy link
Contributor

@nhosoi nhosoi commented Oct 16, 2019

clusterlogging_controller.go - Adding watch for cluster proxy

Borrowed the code from cluster-network-operator/pkg/controller/proxyconfig

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 16, 2019
@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 16, 2019

/retest

@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 16, 2019

/test e2e-operator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should that be e.Meta ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried e.Meta, then it failed as follows...
clusterlogging_controller.go:61:73: e.Meta undefined (type event.UpdateEvent has no field or method Meta)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ewolinetz
Copy link
Contributor

@nhosoi have you tested what the operator's processing looks like when you stack a proxy config change and a clusterlogging change?

@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 16, 2019

@nhosoi have you tested what the operator's processing looks like when you stack a proxy config change and a clusterlogging change?

Thanks for your reviews, @ewolinetz. Well, what I could test was adding noProxy and/or httpProxy to the cluster proxy and check the fluentd env vars if they are applied. And removed them and check them again. httpsProxy and trustedCA are not tested. (not sure how to do so...)

To be honest, with/without this PR, there's no difference in my test results...

@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 17, 2019

This patch looks really breaking the e2e tests... :( But it's not clear to me how adding watch causes these failures...

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-logging-operator/255/pull-ci-openshift-cluster-logging-operator-master-e2e-operator/944/build-log.txt
level=error msg="Cluster operator {} {} is {} with {}: {}%!(EXTRA string=dns, v1.ClusterStatusConditionType=Degraded, v1.ConditionStatus=True, string=NotAllDNSesAvailable, string=Not all desired DNS DaemonSets available)"
level=error msg="Cluster operator {} {} is {} with {}: {}%!(EXTRA string=machine-config, v1.ClusterStatusConditionType=Degraded, v1.ConditionStatus=True, string=MachineConfigDaemonFailed, string=Failed to resync 0.0.1-2019-10-16-173124 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2))"
error: could not run steps: step e2e-operator failed: template pod "e2e-operator" failed: the pod ci-op-mibc2vhb/e2e-operator failed after 1h28m56s (failed containers: setup, test): ContainerFailed one or more containers exited

@ewolinetz
Copy link
Contributor

@nhosoi a watch means that the operator looks for that object to change, and when it does it sends that request through the reconcile loop.

So what may end up happening is we have multiple events that are being re-reconciled (so that we can periodically update our status).

We may need to investigate a better way to check update our status periodically without holding onto events ad infinitum

@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 17, 2019

/test e2e-operator

@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 17, 2019

@nhosoi a watch means that the operator looks for that object to change,
and when it does it sends that request through the reconcile loop.
So what may end up happening is we have multiple events that are being
re-reconciled (so that we can periodically update our status).
We may need to investigate a better way to check update our status
periodically without holding onto events ad infinitum

I think it explains what I'm observing and wondering... Without the newly added watches, the cluster proxy config was consumed in the fluentd pod. Does that mean this cluster proxy event is reconciled by some other request and my addition is redundant??? I added 2 watches, one for the proxy configmap and the other is for the proxy object itself. Let me disable one by one and figure out what is causing this error...

@nhosoi nhosoi added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2019
@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 17, 2019

/test e2e-operator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so one thing to note with doing this, i believe when we get to Reconcile any proxy config changes will push that event - so we may fail to get the clusterlogging instance.

I'm not sure if we can do an EnqueueRequestForOwner in this case to bypass that.
But in the case where we do get a proxy event change, we don't want to requeue it at the end of a successful run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - so how do other operators deal with this? It seems that other operators that need to respect proxy settings would have to deal with this (unless once again logging is the pioneer that gets the arrows . . .)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm learning from cluster-network-operator. The operator has separated reconciler for each watched target(?). In this PR, I piggybacked the proxyconfig watches in the clusterlogging reconciler... Do you think we have to have a separate reconciler as cluster-network-operator does???
https://github.com/openshift/cluster-network-operator/blob/master/pkg/controller/add_networkconfig.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And by commenting out the cluster configmap watch, the e2e test passed.

Copy link
Contributor

@ewolinetz ewolinetz Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster configmap watch

just to clarify and prevent confusion, this isn't a configmap -- its a non-namespaced object of type config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we have to have a separate reconciler as cluster-network-operator does?

I think this is the path we want to take as well, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm trying it and my first cut causes a strange problem. :) If I update cluster proxy spec/status, it restarts all the pods including elasticsearch and kibana... Obviously, there are lots more to learn... But I'm glad I got something to pursue.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 23, 2019
@nhosoi nhosoi force-pushed the log-464 branch 2 times, most recently from 0da021f to b5bbf8c Compare October 23, 2019 00:41
@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 23, 2019

Hi @ewolinetz, I'm stuck... Could you please help me?

Regarding your advice [0], I tried what I could think of (some are left in the patch commented out...) but it looks all of my attempts were invalid and updating cluster proxy affects all pods managed by cluster logging operator. Could you please give me some hints for "updating the proxy reconciler to just update the collector work"?

[0]

you can update your proxy reconciler to just update the collector work... likely something is being updated (maybe a deployment changed? the operator logs should state what happened)

What I'm observing is (the following is a snippet of debug prints [1] I put into the patch [2]) Reconcile for 'instance' is called about every 30 sec. and updating the status of each pod. Of course, it does not restart the pods. When I update the cluster proxy's spec (in my testing noProxy value), then all the pods are restarted. It looks to me it was derived from reconciling 'instance' since there's no k8shandler.Reconcile call in Reconcile in proxyconfig_controller.go...

And one more thing I'm confused is without this PR/attempt - adding Watch for cluster proxy, changes made in cluster proxy status is applied to fluentd EnvVar. That is, it looks to me we don't need the Watch for cluster proxy and this PR is introducing something redundant. But I should be wrong...

[1]
..............
time="2019-10-23T18:24:59Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:01Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:20Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:20Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:39Z" level=info msg="DBG_PX: Proxyconfig reconcile request.Name: 'cluster'"
time="2019-10-23T18:25:39Z" level=info msg="DBG_PX: Proxyconfig reconcile request.Name: 'cluster'"
time="2019-10-23T18:25:50Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:21Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:21Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:55Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
..............

[2]
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-7e30450910148948e030df8d7b99f153R121
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-993a3a5d79e9b1a8103e2bde12087a84R132

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should leave this function definition the same and instead have a separate call for the proxy config reconciler to just adjust the collector

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels very hacky

@nhosoi nhosoi force-pushed the log-464 branch 2 times, most recently from 850178c to b39f488 Compare October 31, 2019 00:45
@nhosoi nhosoi removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2019
@nhosoi
Copy link
Contributor Author

nhosoi commented Oct 31, 2019

/test e2e-operator

@nhosoi nhosoi force-pushed the log-464 branch 4 times, most recently from ea0c6e2 to b1b9d28 Compare November 18, 2019 23:15
@nhosoi
Copy link
Contributor Author

nhosoi commented Nov 18, 2019

/retest

@nhosoi nhosoi force-pushed the log-464 branch 3 times, most recently from 74b4eeb to 0dbeaa3 Compare November 19, 2019 22:53
@ewolinetz
Copy link
Contributor

@bparees were your requested changes addressed?
otherwise lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this is trying to determine if the reconcile request was triggered by a a change to a configmap that logging cares about, but i'm not sure how this works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above code is "Outdated".
Now we are checking the request.Name is in ReconcileForGlobalProxyList, which is {"fluentd-trusted-ca-bundle", "kibana-trusted-ca-bundle"}
} else if utils.ContainsString(constants.ReconcileForGlobalProxyList, request.Name) {
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-993a3a5d79e9b1a8103e2bde12087a84R108

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this watching all configmaps in the entire cluster? does operatorSDK not give us a way to scope the watch to the logging namespace?

Copy link
Contributor Author

@nhosoi nhosoi Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume your question is about ConfigMap at the line 66 (not Proxy). Now ConfigMap is watched if it is in the openshift-logging namespace and the name is fluentd-trusted-ca-bundle or kibana-trusted-ca-bundle.

@ewolinetz ewolinetz self-requested a review November 20, 2019 21:35
@nhosoi nhosoi force-pushed the log-464 branch 2 times, most recently from 364df02 to b411583 Compare November 21, 2019 01:08
@nhosoi
Copy link
Contributor Author

nhosoi commented Nov 21, 2019

@ewolinetz, @bparees, thank you very much for your reviews.
I squashed the patches. Hopefully, they fix the issues you pointed out...

@nhosoi
Copy link
Contributor Author

nhosoi commented Nov 21, 2019

/test e2e-operator

@richm
Copy link
Contributor

richm commented Nov 21, 2019

Please review. We need to merge this by tomorrow, or it won't happen for a couple of weeks. Let's try to get this merged ASAP.

@bparees
Copy link
Contributor

bparees commented Nov 21, 2019

@richm @nhosoi @ewolinetz my concerns around the event handling and deployment rollout triggering have been addressed.. lgtm.

@ewolinetz
Copy link
Contributor

@nhosoi
Copy link
Contributor Author

nhosoi commented Nov 21, 2019

after https://github.com/openshift/cluster-logging-operator/pull/255/files#r348753401 i'll put a flag on this... @nhosoi

Thanks, @ewolinetz, @bparees!!
I love this auto commit feature in github! but still i'd better squash into one pr...

- Adding proxyconfig controller to watch cluster proxy and trusted CA
  bundle configmaps in the openshift-logging namespace. These configmap
  name is KibanaTrustedCAName and FluentdTrustedCAName.
- Adding pkg/constants/constants.go to share the constant strings.
- Simplifying settng proxy environment variables to EnvVar.
- Adding trusted CA bundle configmap support.
  The configmap is being watched in the proxyconfig controller.
  Fluentd daemonset and kibana deployment hold the hash value of ca
  certs in their annotation.  The value is updated if the ca certs
  in the configmap are updated, which triggers the fluentd and kibana
  pods restart and update the mounted tls-ca-bundle.pem file.
  It overrides /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem with
  the certs auto-filled in configmap by "VolumeMount'ing".

utils.EnvVarEqual:
- In EnvVarSourceEqual, replacing reflect.DeepEqual with customized
  EnvVarResourceFieldSelectorEqual since Divisor (type resource.Quantity)
  is not correctly compared by DeepEqual.

Others:
hack/common - Keeping debug_print for future debugging.

This PR fixes the following 3 bugs.
Bug 1752725 - Log into kibana console get `504 Gateway Time-out The
              server didn't respond in time. ` when http_proxy enabled
Bug 1766187 - Authentication "500 Internal Error"'
Bug 1768762 - Fluentd: "Could not communicate to Elasticsearch" when
              http proxy enabled in the cluster.
              Fix: Setting the elasticsearch FQDN to logStoreService
              and elasticsearchName.  The FQDN belongs to the global
              proxy noProxy list. By doing so, it skips the global
              proxy to communicate with the internal elasticsearch.
Bug 1774837 - Too many `warning: The environment variable HTTP_PROXY is
              discouraged. Use http_proxy.` in fluentd pod logs after
              enable forwarding logs to user-managed ES as insecure
              Fix: In addition to HTTP_PROXY, HTTPS_PROXY and NO_PROXY,
              setting http_proxy, https_proxy and no_proxy, as well.
@ewolinetz
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ewolinetz, nhosoi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 6a61d90 into openshift:master Nov 21, 2019
@openshift-ci-robot
Copy link

@nhosoi: All pull requests linked via external trackers have merged. Bugzilla bug 1752725 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1752725: Log into kibana console get 504 Gateway Time-out The server didn't respond in time. when http_proxy enabled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pmoogi-redhat pushed a commit to pmoogi-redhat/cluster-logging-operator that referenced this pull request Apr 26, 2022
Bug 1752725: Log into kibana console get `504 Gateway Time-out The server didn't respond in time. ` when http_proxy enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.