Bug 1752725: Log into kibana console get `504 Gateway Time-out The server didn't respond in time.` when http_proxy enabled #255

nhosoi · 2019-10-16T06:11:34Z

clusterlogging_controller.go - Adding watch for cluster proxy

Borrowed the code from cluster-network-operator/pkg/controller/proxyconfig

nhosoi · 2019-10-16T16:12:25Z

/retest

nhosoi · 2019-10-16T17:27:45Z

/test e2e-operator

ewolinetz · 2019-10-16T18:59:06Z

pkg/controller/clusterlogging/clusterlogging_controller.go

should that be e.Meta ?

I tried e.Meta, then it failed as follows...
clusterlogging_controller.go:61:73: e.Meta undefined (type event.UpdateEvent has no field or method Meta)

There is no e.Meta for UpdateEvent only MetaOld and MetaNew - see https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/event/event.go#L34

ewolinetz · 2019-10-16T19:00:47Z

@nhosoi have you tested what the operator's processing looks like when you stack a proxy config change and a clusterlogging change?

nhosoi · 2019-10-16T20:06:11Z

@nhosoi have you tested what the operator's processing looks like when you stack a proxy config change and a clusterlogging change?

Thanks for your reviews, @ewolinetz. Well, what I could test was adding noProxy and/or httpProxy to the cluster proxy and check the fluentd env vars if they are applied. And removed them and check them again. httpsProxy and trustedCA are not tested. (not sure how to do so...)

To be honest, with/without this PR, there's no difference in my test results...

nhosoi · 2019-10-17T00:10:47Z

This patch looks really breaking the e2e tests... :( But it's not clear to me how adding watch causes these failures...

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-logging-operator/255/pull-ci-openshift-cluster-logging-operator-master-e2e-operator/944/build-log.txt
level=error msg="Cluster operator {} {} is {} with {}: {}%!(EXTRA string=dns, v1.ClusterStatusConditionType=Degraded, v1.ConditionStatus=True, string=NotAllDNSesAvailable, string=Not all desired DNS DaemonSets available)"
level=error msg="Cluster operator {} {} is {} with {}: {}%!(EXTRA string=machine-config, v1.ClusterStatusConditionType=Degraded, v1.ConditionStatus=True, string=MachineConfigDaemonFailed, string=Failed to resync 0.0.1-2019-10-16-173124 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2))"
error: could not run steps: step e2e-operator failed: template pod "e2e-operator" failed: the pod ci-op-mibc2vhb/e2e-operator failed after 1h28m56s (failed containers: setup, test): ContainerFailed one or more containers exited

ewolinetz · 2019-10-17T16:16:46Z

@nhosoi a watch means that the operator looks for that object to change, and when it does it sends that request through the reconcile loop.

So what may end up happening is we have multiple events that are being re-reconciled (so that we can periodically update our status).

We may need to investigate a better way to check update our status periodically without holding onto events ad infinitum

nhosoi · 2019-10-17T16:46:57Z

/test e2e-operator

nhosoi · 2019-10-17T17:09:48Z

@nhosoi a watch means that the operator looks for that object to change,
and when it does it sends that request through the reconcile loop.
So what may end up happening is we have multiple events that are being
re-reconciled (so that we can periodically update our status).
We may need to investigate a better way to check update our status
periodically without holding onto events ad infinitum

I think it explains what I'm observing and wondering... Without the newly added watches, the cluster proxy config was consumed in the fluentd pod. Does that mean this cluster proxy event is reconciled by some other request and my addition is redundant??? I added 2 watches, one for the proxy configmap and the other is for the proxy object itself. Let me disable one by one and figure out what is causing this error...

nhosoi · 2019-10-17T19:04:55Z

/test e2e-operator

ewolinetz · 2019-10-17T22:15:26Z

pkg/controller/clusterlogging/clusterlogging_controller.go

so one thing to note with doing this, i believe when we get to Reconcile any proxy config changes will push that event - so we may fail to get the clusterlogging instance.

I'm not sure if we can do an EnqueueRequestForOwner in this case to bypass that.
But in the case where we do get a proxy event change, we don't want to requeue it at the end of a successful run.

ok - so how do other operators deal with this? It seems that other operators that need to respect proxy settings would have to deal with this (unless once again logging is the pioneer that gets the arrows . . .)

I'm learning from cluster-network-operator. The operator has separated reconciler for each watched target(?). In this PR, I piggybacked the proxyconfig watches in the clusterlogging reconciler... Do you think we have to have a separate reconciler as cluster-network-operator does???
https://github.com/openshift/cluster-network-operator/blob/master/pkg/controller/add_networkconfig.go

And by commenting out the cluster configmap watch, the e2e test passed.

cluster configmap watch

just to clarify and prevent confusion, this isn't a configmap -- its a non-namespaced object of type config

Do you think we have to have a separate reconciler as cluster-network-operator does?

I think this is the path we want to take as well, yes.

Thanks! I'm trying it and my first cut causes a strange problem. :) If I update cluster proxy spec/status, it restarts all the pods including elasticsearch and kibana... Obviously, there are lots more to learn... But I'm glad I got something to pursue.

nhosoi · 2019-10-23T19:36:51Z

Hi @ewolinetz, I'm stuck... Could you please help me?

Regarding your advice [0], I tried what I could think of (some are left in the patch commented out...) but it looks all of my attempts were invalid and updating cluster proxy affects all pods managed by cluster logging operator. Could you please give me some hints for "updating the proxy reconciler to just update the collector work"?

[0]

you can update your proxy reconciler to just update the collector work... likely something is being updated (maybe a deployment changed? the operator logs should state what happened)

What I'm observing is (the following is a snippet of debug prints [1] I put into the patch [2]) Reconcile for 'instance' is called about every 30 sec. and updating the status of each pod. Of course, it does not restart the pods. When I update the cluster proxy's spec (in my testing noProxy value), then all the pods are restarted. It looks to me it was derived from reconciling 'instance' since there's no k8shandler.Reconcile call in Reconcile in proxyconfig_controller.go...

And one more thing I'm confused is without this PR/attempt - adding Watch for cluster proxy, changes made in cluster proxy status is applied to fluentd EnvVar. That is, it looks to me we don't need the Watch for cluster proxy and this PR is introducing something redundant. But I should be wrong...

[1]
..............
time="2019-10-23T18:24:59Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:01Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:20Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:20Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:25:39Z" level=info msg="DBG_PX: Proxyconfig reconcile request.Name: 'cluster'"
time="2019-10-23T18:25:39Z" level=info msg="DBG_PX: Proxyconfig reconcile request.Name: 'cluster'"
time="2019-10-23T18:25:50Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:21Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:21Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
time="2019-10-23T18:26:55Z" level=info msg="DBG_CL: Clusterlogging reconcile request.Name: 'instance'"
..............

[2]
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-7e30450910148948e030df8d7b99f153R121
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-993a3a5d79e9b1a8103e2bde12087a84R132

pkg/controller/proxyconfig/proxyconfig_controller.go

ewolinetz · 2019-10-24T16:36:09Z

pkg/controller/clusterlogging/clusterlogging_controller.go

i think we should leave this function definition the same and instead have a separate call for the proxy config reconciler to just adjust the collector

ewolinetz · 2019-10-24T16:36:57Z

pkg/k8shandler/reconciler.go

this feels very hacky

nhosoi · 2019-10-31T15:00:47Z

/test e2e-operator

nhosoi · 2019-11-18T23:33:18Z

/retest

ewolinetz · 2019-11-20T15:22:32Z

@bparees were your requested changes addressed?
otherwise lgtm

pkg/k8shandler/trustedcabundle.go

bparees · 2019-11-20T21:33:46Z

pkg/controller/proxyconfig/proxyconfig_controller.go

it looks like this is trying to determine if the reconcile request was triggered by a a change to a configmap that logging cares about, but i'm not sure how this works?

The above code is "Outdated".
Now we are checking the request.Name is in ReconcileForGlobalProxyList, which is {"fluentd-trusted-ca-bundle", "kibana-trusted-ca-bundle"}
} else if utils.ContainsString(constants.ReconcileForGlobalProxyList, request.Name) {
https://github.com/openshift/cluster-logging-operator/pull/255/files#diff-993a3a5d79e9b1a8103e2bde12087a84R108

bparees · 2019-11-20T21:34:08Z

pkg/controller/proxyconfig/proxyconfig_controller.go

is this watching all configmaps in the entire cluster? does operatorSDK not give us a way to scope the watch to the logging namespace?

I assume your question is about ConfigMap at the line 66 (not Proxy). Now ConfigMap is watched if it is in the openshift-logging namespace and the name is fluentd-trusted-ca-bundle or kibana-trusted-ca-bundle.

nhosoi · 2019-11-21T01:11:04Z

@ewolinetz, @bparees, thank you very much for your reviews.
I squashed the patches. Hopefully, they fix the issues you pointed out...

nhosoi · 2019-11-21T05:26:33Z

/test e2e-operator

richm · 2019-11-21T17:14:23Z

Please review. We need to merge this by tomorrow, or it won't happen for a couple of weeks. Let's try to get this merged ASAP.

bparees · 2019-11-21T19:21:44Z

@richm @nhosoi @ewolinetz my concerns around the event handling and deployment rollout triggering have been addressed.. lgtm.

ewolinetz · 2019-11-21T19:29:21Z

after https://github.com/openshift/cluster-logging-operator/pull/255/files#r348753401 i'll put a flag on this... @nhosoi

nhosoi · 2019-11-21T19:58:02Z

after https://github.com/openshift/cluster-logging-operator/pull/255/files#r348753401 i'll put a flag on this... @nhosoi

Thanks, @ewolinetz, @bparees!!
I love this auto commit feature in github! but still i'd better squash into one pr...

- Adding proxyconfig controller to watch cluster proxy and trusted CA bundle configmaps in the openshift-logging namespace. These configmap name is KibanaTrustedCAName and FluentdTrustedCAName. - Adding pkg/constants/constants.go to share the constant strings. - Simplifying settng proxy environment variables to EnvVar. - Adding trusted CA bundle configmap support. The configmap is being watched in the proxyconfig controller. Fluentd daemonset and kibana deployment hold the hash value of ca certs in their annotation. The value is updated if the ca certs in the configmap are updated, which triggers the fluentd and kibana pods restart and update the mounted tls-ca-bundle.pem file. It overrides /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem with the certs auto-filled in configmap by "VolumeMount'ing". utils.EnvVarEqual: - In EnvVarSourceEqual, replacing reflect.DeepEqual with customized EnvVarResourceFieldSelectorEqual since Divisor (type resource.Quantity) is not correctly compared by DeepEqual. Others: hack/common - Keeping debug_print for future debugging. This PR fixes the following 3 bugs. Bug 1752725 - Log into kibana console get `504 Gateway Time-out The server didn't respond in time. ` when http_proxy enabled Bug 1766187 - Authentication "500 Internal Error"' Bug 1768762 - Fluentd: "Could not communicate to Elasticsearch" when http proxy enabled in the cluster. Fix: Setting the elasticsearch FQDN to logStoreService and elasticsearchName. The FQDN belongs to the global proxy noProxy list. By doing so, it skips the global proxy to communicate with the internal elasticsearch. Bug 1774837 - Too many `warning: The environment variable HTTP_PROXY is discouraged. Use http_proxy.` in fluentd pod logs after enable forwarding logs to user-managed ES as insecure Fix: In addition to HTTP_PROXY, HTTPS_PROXY and NO_PROXY, setting http_proxy, https_proxy and no_proxy, as well.

ewolinetz · 2019-11-21T21:31:51Z

/lgtm

openshift-ci-robot · 2019-11-21T21:32:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ewolinetz, nhosoi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ewolinetz,nhosoi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-11-21T21:33:15Z

@nhosoi: All pull requests linked via external trackers have merged. Bugzilla bug 1752725 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1752725: Log into kibana console get 504 Gateway Time-out The server didn't respond in time. when http_proxy enabled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Bug 1752725: Log into kibana console get `504 Gateway Time-out The server didn't respond in time. ` when http_proxy enabled

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 16, 2019

openshift-ci-robot requested review from jcantrill and lukas-vlcek October 16, 2019 06:12

ewolinetz reviewed Oct 16, 2019

View reviewed changes

nhosoi added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2019

nhosoi force-pushed the log-464 branch from a99513c to 955891b Compare October 17, 2019 21:16

ewolinetz reviewed Oct 17, 2019

View reviewed changes

nhosoi force-pushed the log-464 branch from 955891b to d2b2385 Compare October 23, 2019 00:08

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 23, 2019

nhosoi force-pushed the log-464 branch 2 times, most recently from 0da021f to b5bbf8c Compare October 23, 2019 00:41

ewolinetz reviewed Oct 24, 2019

View reviewed changes

pkg/controller/proxyconfig/proxyconfig_controller.go Outdated Show resolved Hide resolved

ewolinetz reviewed Oct 24, 2019

View reviewed changes

pkg/k8shandler/reconciler.go Outdated

Copy link

Contributor

ewolinetz Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels very hacky

nhosoi reacted with thumbs up emoji

nhosoi force-pushed the log-464 branch 2 times, most recently from 850178c to b39f488 Compare October 31, 2019 00:45

nhosoi removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2019

nhosoi force-pushed the log-464 branch 4 times, most recently from ea0c6e2 to b1b9d28 Compare November 18, 2019 23:15

nhosoi force-pushed the log-464 branch 3 times, most recently from 74b4eeb to 0dbeaa3 Compare November 19, 2019 22:53

ewolinetz approved these changes Nov 20, 2019

View reviewed changes

bparees reviewed Nov 20, 2019

View reviewed changes

ewolinetz self-requested a review November 20, 2019 21:35

nhosoi force-pushed the log-464 branch 2 times, most recently from 364df02 to b411583 Compare November 21, 2019 01:08

nhosoi force-pushed the log-464 branch from b411583 to dde5396 Compare November 21, 2019 06:12

nhosoi force-pushed the log-464 branch from 4f88bc9 to a75f91d Compare November 21, 2019 20:05

openshift-ci-robot assigned ewolinetz Nov 21, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2019

openshift-merge-robot merged commit 6a61d90 into openshift:master Nov 21, 2019

Bug 1752725: Log into kibana console get 504 Gateway Time-out The server didn't respond in time. when http_proxy enabled #255

Bug 1752725: Log into kibana console get 504 Gateway Time-out The server didn't respond in time. when http_proxy enabled #255

Uh oh!

Conversation

nhosoi commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 16, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ewolinetz commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 16, 2019

Uh oh!

nhosoi commented Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ewolinetz commented Oct 17, 2019

Uh oh!

nhosoi commented Oct 17, 2019

Uh oh!

nhosoi commented Oct 17, 2019

Uh oh!

nhosoi commented Oct 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ewolinetz Oct 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhosoi commented Oct 23, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhosoi commented Oct 31, 2019

Uh oh!

nhosoi commented Nov 18, 2019

Uh oh!

ewolinetz commented Nov 20, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhosoi Nov 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nhosoi commented Nov 21, 2019

Uh oh!

nhosoi commented Nov 21, 2019

Uh oh!

richm commented Nov 21, 2019

Uh oh!

bparees commented Nov 21, 2019

Uh oh!

ewolinetz commented Nov 21, 2019

Bug 1752725: Log into kibana console get `504 Gateway Time-out The server didn't respond in time.` when http_proxy enabled #255

Bug 1752725: Log into kibana console get `504 Gateway Time-out The server didn't respond in time.` when http_proxy enabled #255

nhosoi commented Oct 17, 2019 •

edited

Loading

ewolinetz Oct 18, 2019 •

edited

Loading

nhosoi Nov 21, 2019 •

edited

Loading