install/telemetry: add prometheus alerts #1223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

openshift-merge-robot merged 2 commits into openshift:master from kikisdeliveryservice:prom-alerts

Nov 5, 2019

Contributor

kikisdeliveryservice commented Oct 29, 2019 •

edited

Loading

added alerts for following metrics:
MCDRebootErr(critical), MCDDrainErr(critical),
MCDPivotErr(warning), KubeletHealthState(warning)

openshift-ci-robot added do-not-merge/work-in-progress size/S approved labels

openshift-ci-robot requested review from LorbusChris and ericavonb

October 29, 2019 18:58

kikisdeliveryservice force-pushed the prom-alerts branch from 57f7a3d to 2646e1a Compare

October 29, 2019 20:27

kikisdeliveryservice removed request for LorbusChris and ericavonb

October 29, 2019 21:01

Contributor Author

kikisdeliveryservice commented Oct 29, 2019

/skip

openshift-ci-robot added size/M and removed size/S labels

kikisdeliveryservice changed the title ~~WIP: mcd prom alerts~~ WIP: mcd: add prometheus alerts

Contributor Author

kikisdeliveryservice commented Oct 30, 2019

/skip

kikisdeliveryservice force-pushed the prom-alerts branch from 42d9f84 to 5273b7c Compare

October 30, 2019 00:57

Contributor Author

kikisdeliveryservice commented Oct 30, 2019

/skip

1 similar comment

Contributor Author

kikisdeliveryservice commented Oct 30, 2019

/skip

Contributor Author

kikisdeliveryservice commented Oct 30, 2019

/retest

2 similar comments

Contributor Author

kikisdeliveryservice commented Oct 31, 2019

/retest

Contributor Author

kikisdeliveryservice commented Oct 31, 2019

/retest

Contributor Author

kikisdeliveryservice commented Oct 31, 2019 •

edited

Loading

Weird cluster monitoring errors that I dont think are related to PR, our metrics went up and at a quick glance things seem to be in order.

kikisdeliveryservice changed the title ~~WIP: mcd: add prometheus alerts~~ daemon: add prometheus alerts

openshift-ci-robot removed the do-not-merge/work-in-progress label

kikisdeliveryservice force-pushed the prom-alerts branch from 5273b7c to 02bb575 Compare

October 31, 2019 06:46

kikisdeliveryservice changed the title ~~daemon: add prometheus alerts~~ install/telemetry: add prometheus alerts

kikisdeliveryservice requested a review from runcom

October 31, 2019 06:47

Contributor Author

kikisdeliveryservice commented Oct 31, 2019

/retest

Member

runcom commented Oct 31, 2019

/approve
/lgtm

openshift-ci-robot assigned runcom

openshift-ci-robot added the lgtm label

kikisdeliveryservice force-pushed the prom-alerts branch from 02bb575 to 05dfbbe Compare

October 31, 2019 17:53

Member

runcom commented Oct 31, 2019

/skip
/lgtm

openshift-ci-robot added the lgtm label

Contributor

openshift-ci-robot commented Oct 31, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Contributor Author

kikisdeliveryservice commented Oct 31, 2019

level=info msg="Cluster operator insights Disabled is False with : "

/test e2e-aws

Contributor Author

kikisdeliveryservice commented Oct 31, 2019

Something wrong with ci other prs hit this too:

    mcd_test.go:549: Created fips-4b24822d-9d97-4a39-8556-2639a634f3a1
    mcd_test.go:119: Pool worker has rendered config fips-4b24822d-9d97-4a39-8556-2639a634f3a1 with rendered-worker-50784165656d847901c65e183761be2f (waited 6.00933476s)
    mcd_test.go:555: pool worker didn't report updated to rendered-worker-50784165656d847901c65e183761be2f: timed out waiting for the condition ```

/test e2e-gcp-op

Contributor Author

kikisdeliveryservice commented Oct 31, 2019

Reported: "Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 35.231.162.72:443: connect: connection refused

/test e2e-gcp-op

Contributor Author

kikisdeliveryservice commented Nov 1, 2019

level=error msg="Error: Error waiting for Deleting Firewall: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 20m0s)" ?!??

/test e2e-gcp-op

Contributor Author

kikisdeliveryservice commented Nov 1, 2019

evel=error msg="Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 35.231.193.34:443: connect: connection refused"

more dbad runs :(

/test e2e-gcp-op

Contributor Author

kikisdeliveryservice commented Nov 1, 2019

Again...

level=error
level=error msg="  on ../tmp/openshift-install-140427364/dns/base.tf line 39, in resource \"google_dns_record_set\" \"etcd_a_nodes\":"
level=error msg="  39: resource \"google_dns_record_set\" \"etcd_a_nodes\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"
2019/11/01 02:35:54 Container setup in pod e2e-gcp-op failed, exit code 1, reason Error
Another process exited

/test e2e-gcp-op

Member

runcom commented Nov 1, 2019

/retest

Contributor Author

kikisdeliveryservice commented Nov 1, 2019

level=error msg="Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 35.196.232.189:443: connect: connection refused"

/test e2e-gcp-op

Contributor Author

kikisdeliveryservice commented Nov 1, 2019

hitting same error as here: #1232 (comment)

reported and not retesting for now..

Contributor

openshift-bot commented Nov 2, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments

Contributor

openshift-bot commented Nov 2, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 3, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 3, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 3, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 4, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 4, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor Author

kikisdeliveryservice commented Nov 4, 2019

/test e2e-gcp-op

Contributor

openshift-bot commented Nov 5, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments

Contributor

openshift-bot commented Nov 5, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 5, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 5, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-bot commented Nov 5, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

Contributor

openshift-ci-robot commented Nov 5, 2019 •

edited

Loading

@kikisdeliveryservice: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`6dac9f2`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Contributor Author

kikisdeliveryservice commented Nov 5, 2019

/test e2e-gcp-op

Contributor Author

kikisdeliveryservice commented Nov 5, 2019

/retest

openshift-merge-robot merged commit 37b9252 into openshift:master

wking mentioned this pull request

Bug 1861876: telemetry: change MCDDrainErr from critical to warning #1959

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm size/M