Skip to content

Conversation

@sgreene570
Copy link
Contributor

Follow up PR to #476 in support of NE-199.

This PR adds logic to periodically test the canary route. Canary probe results are reported very verbosely via the ingress operator's logs, somewhat verbosely via newly declared Prometheus metrics, and very granularly through the default ingress controllers status conditions.

This PR adds the go-httpstat, which provides boiler plate code for measuring HTTP latencies off of go's native httpstat library.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2020
@sgreene570 sgreene570 force-pushed the NE-199-phase-2 branch 2 times, most recently from 14a2028 to 1db6ee0 Compare November 18, 2020 15:32
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 18, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 18, 2020
@sgreene570
Copy link
Contributor Author

/test e2e-aws-operator so I can verify the e2e tests I drafted

@sgreene570
Copy link
Contributor Author

sgreene570 commented Nov 18, 2020

Soliciting initial feedback 🚢
/cc @openshift/openshift-team-network-edge

@openshift-ci-robot openshift-ci-robot requested a review from a team November 18, 2020 19:11
@sgreene570 sgreene570 force-pushed the NE-199-phase-2 branch 2 times, most recently from 4920a28 to fd7eac0 Compare November 18, 2020 21:38
@sgreene570
Copy link
Contributor Author

need to investigate why the new canary e2e test is not passing

@sgreene570 sgreene570 force-pushed the NE-199-phase-2 branch 3 times, most recently from 5af1d60 to c7dfe44 Compare November 19, 2020 21:49
@sgreene570
Copy link
Contributor Author

/retest

@sgreene570 sgreene570 force-pushed the NE-199-phase-2 branch 2 times, most recently from f57b083 to c0085e8 Compare November 20, 2020 17:01
@sgreene570 sgreene570 changed the title [WIP] NE-199 Phase 2: Add periodic canary route HTTP checks w/ metrics & basic status reporting NE-199 Phase 2: Add periodic canary route HTTP checks w/ metrics & basic status reporting Nov 20, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2020
@sgreene570
Copy link
Contributor Author

0f519d7 updates this PR to account for openshift/origin#25703

@sgreene570
Copy link
Contributor Author

/retest

1 similar comment
@sgreene570
Copy link
Contributor Author

/retest

@sgreene570
Copy link
Contributor Author

/retest

)

var (
CanaryRequestTime = prometheus.NewHistogramVec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RiRa12621 Hey if you have a second could you glance at these metrics and make sure they look good to you? Asking since we've collaborated on Ingress metrics in the past 😄

@sgreene570
Copy link
Contributor Author

/retest

@Miciah
Copy link
Contributor

Miciah commented Dec 2, 2020

Looks good, but #498 has caused a conflict.

Add tcnksm/go-httpstat to go.mod and vendor/

The go-httpstat dep wraps the go httptrace library
with boiler-plate timing definitions.

Add prometheus/client_golang to go.mod and vendor/

This allows the ingress operator to create it's own
metrics.
Use the stop channel used to control the operator
in cmd/ingress-operator/start.go to control the
Canary metrics and canary route polling loop.
@sgreene570
Copy link
Contributor Author

Looks good, but #498 has caused a conflict.

Fixed! Ready to go 🚢

@sgreene570
Copy link
Contributor Author

CI is failing on VPC limits
/retest

Copy link
Contributor

@frobware frobware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, just one case where we don't handle an err in the tests that needs fixing. The rest observations/comments/suggestions.

pkg/operator/controller/canary:

Add basic HTTP checks for the canary route created in
NE-199 phase 1. Add canary metrics to the canary controller
and populate them with the results of the canary checks.

Add a canary check status condition to the default ingress
controller to mark the ingress operator as degraded
when canary checks are unsucessful for a sustained period
of time.

pkg/operator/controller/canary/http.go: New file to define
the canary route check HTTP funcs using go-httpstat.

pkg/operator/controller/canary/metrics.go: New file to define
the canary route check metrics to be collected via prometheus.
pkg/operator/controller/ingress/controller.go: Define
the canary status condition name.

pkg/operator/controller/ingress/status.go: Mark the
default ingress controller as degraded when the canary
status condition has been set to false for a sustained period
of time.

pkg/operator/controller/ingress/status_test.go: Add test cases
to verify the default ingress controller canary status condition
changes.
Add a test to test/e2e/operator_test.go that verifies the
ingress clusteroperator's status related objects.

Add 2 tests to a new test file, test/e2e/canary_test.go.

The first test verifies that the Ingress canary route and echo pod (hello-openshift)
work as intended.

The second test verifies that the default ingress controller
reports the canary check success status condition after a short period
of time.
`origin-hello-openshift` is now publicly available
on quay.io. Use the publicly available image
in `manifests/image-references` (instead of the outdated
`hello-openshift` image hosted on DockerHub).

Note: This only affects ingress operator development, as the CVO
overrides this reference to use the hello-openshift image in
the release payload anyways.
@Miciah
Copy link
Contributor

Miciah commented Dec 2, 2020

#493 (comment) made me realize, test/e2e/canary_test.go should be able to use kclient, right? If so, we can clean that up in a follow-up.

@sgreene570
Copy link
Contributor Author

#493 (comment) made me realize, test/e2e/canary_test.go should be able to use kclient, right? If so, we can clean that up in a follow-up.

I believe so. I'll make a note to fix that in a follow up.

@Miciah
Copy link
Contributor

Miciah commented Dec 2, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah, sgreene570

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sgreene570
Copy link
Contributor Author

sgreene570 commented Dec 2, 2020

level=error msg=Error: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.
level=error msg=	status code: 400, request id: 2032d3f9-cd40-48be-81d6-a663735959fe

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/493/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-operator/1334192150635614208

@sgreene570
Copy link
Contributor Author

/retest

@sgreene570
Copy link
Contributor Author

/refresh
/retest

@openshift-ci-robot
Copy link
Contributor

@sgreene570: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws 91b9929 link /test e2e-aws
ci/prow/e2e-upgrade 91b9929 link /test e2e-upgrade
ci/prow/e2e-aws-operator 91b9929 link /test e2e-aws-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sgreene570
Copy link
Contributor Author

One more to go!
/test e2e-upgrade

@openshift-merge-robot openshift-merge-robot merged commit d323ac2 into openshift:master Dec 3, 2020
candita pushed a commit to candita/cluster-ingress-operator that referenced this pull request Dec 3, 2020
NE-199 Phase 2: Add periodic canary route HTTP checks w/ metrics & basic status reporting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants