Skip to content

Conversation

@Miciah
Copy link
Contributor

@Miciah Miciah commented Sep 1, 2020

Bump openshift/api for ELB connection idle timeout

Bump to github.com/openshift/api@b25f69a603a76ccc809f986c9f5811f0825febbb to get the AWS ELB connection idle timeout API.

  • go.mod: Update.
  • go.sum:
  • manifests/00-custom-resource-definition.yaml:
  • pkg/manifests/bindata.go:
  • vendor/github.com/openshift/api/*:
  • vendor/modules.txt: Regenerate.

test/e2e: import API errors package as apierrors

Import the apimachinery errors package as "apierrors".

  • test/e2e/operator_test.go: Import the "k8s.io/apimachinery/pkg/api/errors" package as "apierrors" so as not to conflict with the standard "errors" package.

desiredLoadBalancerService: Simplify with "lb" var

  • pkg/operator/controller/ingress/load_balancer_service.go (desiredLoadBalancerService): Introduce an "lb" variable to shorten some long lines.

desiredLoadBalancerService: Check for nil LB status

  • pkg/operator/controller/ingress/load_balancer_service.go (desiredLoadBalancerService): Add a nil check just in case status.endpointPublishingStrategy.LoadBalancer is nil somehow.

setDefaultPublishingStrategy: Rework GCP logic

Rework the GCP load-balancer provider parameters defaulting logic in preparation for the next change. Besides simplifying the logic, this change also changes the defaulting logic to ignore unknown provider parameters and to ignore provider parameters for platforms other than the actual platform. These changes should avoid surprises when more provider parameters are added to the API later on as well as prevent weird behavior when the user sets GCP provider parameters on non-GCP clusters.

  • pkg/operator/controller/ingress/controller.go (setDefaultPublishingStrategy): Rework defaulting logic for GCP load-balancer provider parameters.

Allow configuring ELB connection idle timeout

  • pkg/operator/controller/ingress/controller.go (setDefaultPublishingStrategy): Handle changes to the connection idle timeout for an AWS ELB.
  • pkg/operator/controller/ingress/controller_test.go (TestSetDefaultPublishingStrategyHandlesUpdates): Add test cases for changing the ELB connection idle timeout.
  • pkg/operator/controller/ingress/load_balancer_service.go (awsELBConnectionIdleTimeoutAnnotation): New constant.
    (managedLoadBalancerServiceAnnotations): Add awsELBConnectionIdleTimeoutAnnotation.
    (desiredLoadBalancerService): Set the connection idle timeout annotation if the ingresscontroller specifies a non-nil connectionIdleTimeout value.
  • pkg/operator/controller/ingress/load_balancer_service_test.go (TestDesiredLoadBalancerServiceAWSIdleTimeout): New test.
  • test/e2e/operator_test.go (TestAWSELBConnectionIdleTimeout): New test.
  • test/e2e/util.go (buildSlowHTTPDPod): New helper for TestAWSELBConnectionIdleTimeout.

Related to openshift/enhancements#461.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 1, 2020
@Miciah Miciah changed the title Allow configuring ELB connection idle timeout WIP: Allow configuring ELB connection idle timeout Sep 1, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2020
@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch 2 times, most recently from 04c7846 to ff7701e Compare September 2, 2020 20:08
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 2, 2020
@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch from ff7701e to 5c5286b Compare September 2, 2020 21:03
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 2, 2020
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 2, 2020

@Miciah: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws 5c5286b link /test e2e-aws
ci/prow/e2e-aws-operator 5c5286b link /test e2e-aws-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

@Miciah: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-upgrade 5c5286b link /test e2e-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 19, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Apr 7, 2022

/reopen
/remove-lifecycle rotten

@openshift-ci openshift-ci bot reopened this Apr 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 7, 2022

@Miciah: Reopened this PR.

Details

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 7, 2022
@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch from 5c5286b to d024778 Compare April 7, 2022 06:31
@Miciah
Copy link
Contributor Author

Miciah commented Apr 7, 2022

In the last e2e-aws-operator run, the new TestAWSELBConnectionIdleTimeout passed like a champ, but TestHostNetworkEndpointPublishingStrategy failed:

 === RUN   TestHostNetworkEndpointPublishingStrategy
    operator_test.go:758: failed to observe expected conditions: timed out waiting for the condition
    operator_test.go:760: deleted ingresscontroller host
--- FAIL: TestHostNetworkEndpointPublishingStrategy (300.03s) 

That test has been failing on a few PRs over the past 26 hours.

/retest

@Miciah
Copy link
Contributor Author

Miciah commented Apr 7, 2022

In the last e2e-aws-operator run, TestConfigurableRouteRBAC, TestConfigurableRouteNoSecretNoRBAC, TestConfigurableRouteNoConsumingUserNoRBAC, TestHostNetworkEndpointPublishingStrategy, and TestAWSELBConnectionIdleTimeout failed. The last failed with the following error message:

     operator_test.go:2208: failed to get ingresscontroller: Get "https://api.ci-op-1yr8trkg-43abb.origin-ci-int-aws.dev.rhcloud.com:6443/apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/test-idle-timeout": http2: client connection lost 

I could add some retry logic around the Get call that failed, but the E2E tests have a lot of Get calls without retry logic, and the other failures in the test run lead me to think that the cluster had some other issue unrelated to this PR.

/retest

@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch from d024778 to 70f8c6b Compare April 7, 2022 22:02
@Miciah
Copy link
Contributor Author

Miciah commented Apr 7, 2022

Latest push adds a unit test that I had forgotten to commit.

Rework the GCP load-balancer provider parameters defaulting logic in
preparation for an upcoming commit.  Besides simplifying the logic, this
commit also changes the defaulting logic to ignore unknown provider
parameters and to ignore provider parameters for platforms other than the
actual platform.  These changes should avoid surprises when more provider
parameters are added to the API later on as well as prevent weird behavior
when the user sets GCP provider parameters on non-GCP clusters.

* pkg/operator/controller/ingress/controller.go
(setDefaultPublishingStrategy): Rework defaulting logic for GCP
load-balancer provider parameters.
@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch from 90d4ea4 to 1a310ef Compare May 10, 2022 13:10
@Miciah
Copy link
Contributor Author

Miciah commented May 10, 2022

Rebased for #735.

@Miciah
Copy link
Contributor Author

Miciah commented May 16, 2022

/assign frobware
/assign gcs278

Copy link
Contributor

@frobware frobware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some places in the e2e test that really should call t.Fatal() and not t.Error(). Plus a question regarding commit - does this belong to this PR; the commit message mentions it is for an upcoming change.


return true, nil
}); err != nil {
t.Errorf("failed to observe expected condition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Fatal()? Can we carry on without lookup succeeding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it wouldn't hurt to make this Fatal.


return false, nil
}); err != nil {
t.Errorf("failed to observe expected condition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fatal too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it potentially useful to know whether a large value causes the expected behavior when diagnosing why a low value does not?


return true, nil
}); err != nil {
t.Errorf("failed to observe expected condition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fatal.


return true, nil
}); err != nil {
t.Errorf("failed to observe expected condition: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fatal. Although it's at the end of the test we may as well be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're suggesting making every Error into a Fatal. What's an example of where you would advise using Error?

We should use Fatal if any subsequent testing can only produce garbage. However, I think in some of these instances where I used Error in this test, there could be value in continuing the test. For example, even if setting a 3-second timeout didn't behave as expected, we can still try a 120-second timeout to see whether it behaves as expected, and the result could be helpful in diagnosing why the 3-second timeout didn't behave as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're suggesting making every Error into a Fatal. What's an example of where you would advise using Error?

Table-driven unit tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use Fatal if any subsequent testing can only produce garbage

Isn't that the case for all these e2e tests; we try to setup up the cluster/objects/resources/state in a very particular way and if that doesn't happen is it worth generating cascading failures messages if the test was to continue? If you were to debug a failing test you're likely to start with the first error message. Would further error messages help?

if err := wait.PollImmediate(1*time.Second, 5*time.Minute, func() (bool, error) {
_, err := net.LookupIP(route.Spec.Host)
if err != nil {
t.Log(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be really chatty, particularly at 1s interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's test output, it should be under 300 lines, and it's hidden unless a test fails. Is that too chatty? I could add some logic to suppress the log message if it's identical to the previous one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just bump the interval to 3s?

On my sample size of 1 run, the lookup resolved in ~60s.

Perhaps I'm just a little wary if this becomes a parallel test candidate [1]. One downside of running some or a lot of the tests in parallel is the interleaved test output.

Not a blocker for me on the PR though. Was just an observation.

[1] PR #756.

@@ -412,19 +412,63 @@ func setDefaultPublishingStrategy(ic *operatorv1.IngressController, infraConfig
changed = true
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does commit "rework GCP logic" belong to this PR? Can it go in the upcoming commit?

@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch from 1a310ef to fc7c323 Compare May 17, 2022 22:55
@Miciah
Copy link
Contributor Author

Miciah commented May 17, 2022

I've changed Error to Fatal in the E2E test.

@frobware
Copy link
Contributor

/lgtm

@frobware
Copy link
Contributor

/retest

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2022
@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch from fc7c323 to eee2928 Compare May 18, 2022 15:47
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 18, 2022
* pkg/operator/controller/ingress/controller.go
(setDefaultPublishingStrategy): Handle changes to the connection idle
timeout for an AWS ELB.
* pkg/operator/controller/ingress/controller_test.go
(TestSetDefaultPublishingStrategyHandlesUpdates): Add test cases for
changing the ELB connection idle timeout.
* pkg/operator/controller/ingress/load_balancer_service.go
(awsELBConnectionIdleTimeoutAnnotation): New constant.
(managedLoadBalancerServiceAnnotations): Add
awsELBConnectionIdleTimeoutAnnotation.
(desiredLoadBalancerService): Set the connection idle timeout annotation if
the ingresscontroller specifies a non-nil connectionIdleTimeout value.
* pkg/operator/controller/ingress/load_balancer_service_test.go
(TestDesiredLoadBalancerServiceAWSIdleTimeout): New test.
(TestLoadBalancerServiceChanged): Add a test case for the
connection-idle-timeout annotation.
* test/e2e/operator_test.go (TestAWSELBConnectionIdleTimeout): New test.
* test/e2e/util.go (buildSlowHTTPDPod): New helper for
TestAWSELBConnectionIdleTimeout.
@Miciah Miciah force-pushed the allow-configuring-ELB-connection-idle-timeout branch from eee2928 to a635566 Compare May 19, 2022 18:30
@Miciah
Copy link
Contributor Author

Miciah commented May 23, 2022

/retest

@frobware
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 23, 2022

@Miciah: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. docs-approved Signifies that Docs has signed off on this PR lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants