Skip to content

Conversation

@kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice commented Oct 29, 2019

added alerts for following metrics:
MCDRebootErr(critical), MCDDrainErr(critical),
MCDPivotErr(warning), KubeletHealthState(warning)

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 29, 2019
@kikisdeliveryservice
Copy link
Contributor Author

/skip

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 29, 2019
@kikisdeliveryservice kikisdeliveryservice changed the title WIP: mcd prom alerts WIP: mcd: add prometheus alerts Oct 30, 2019
@kikisdeliveryservice
Copy link
Contributor Author

/skip

@kikisdeliveryservice
Copy link
Contributor Author

/skip

1 similar comment
@kikisdeliveryservice
Copy link
Contributor Author

/skip

@kikisdeliveryservice
Copy link
Contributor Author

/retest

2 similar comments
@kikisdeliveryservice
Copy link
Contributor Author

/retest

@kikisdeliveryservice
Copy link
Contributor Author

/retest

@kikisdeliveryservice
Copy link
Contributor Author

kikisdeliveryservice commented Oct 31, 2019

Weird cluster monitoring errors that I dont think are related to PR, our metrics went up and at a quick glance things seem to be in order.

@kikisdeliveryservice kikisdeliveryservice changed the title WIP: mcd: add prometheus alerts daemon: add prometheus alerts Oct 31, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2019
@kikisdeliveryservice kikisdeliveryservice changed the title daemon: add prometheus alerts install/telemetry: add prometheus alerts Oct 31, 2019
@kikisdeliveryservice
Copy link
Contributor Author

/retest

@runcom
Copy link
Member

runcom commented Oct 31, 2019

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2019
@runcom
Copy link
Member

runcom commented Oct 31, 2019

/skip
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kikisdeliveryservice, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kikisdeliveryservice
Copy link
Contributor Author

level=info msg="Cluster operator insights Disabled is False with : " 

/test e2e-aws

@kikisdeliveryservice
Copy link
Contributor Author

Something wrong with ci other prs hit this too:

    mcd_test.go:549: Created fips-4b24822d-9d97-4a39-8556-2639a634f3a1
    mcd_test.go:119: Pool worker has rendered config fips-4b24822d-9d97-4a39-8556-2639a634f3a1 with rendered-worker-50784165656d847901c65e183761be2f (waited 6.00933476s)
    mcd_test.go:555: pool worker didn't report updated to rendered-worker-50784165656d847901c65e183761be2f: timed out waiting for the condition ```

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor Author

Reported: "Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 35.231.162.72:443: connect: connection refused

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor Author

level=error msg="Error: Error waiting for Deleting Firewall: timeout while waiting for state to become 'DONE' (last state: 'RUNNING', timeout: 20m0s)" ?!??

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor Author

evel=error msg="Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 35.231.193.34:443: connect: connection refused"

more dbad runs :(

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor Author

Again...

level=error
level=error msg="  on ../tmp/openshift-install-140427364/dns/base.tf line 39, in resource \"google_dns_record_set\" \"etcd_a_nodes\":"
level=error msg="  39: resource \"google_dns_record_set\" \"etcd_a_nodes\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"
2019/11/01 02:35:54 Container setup in pod e2e-gcp-op failed, exit code 1, reason Error
Another process exited 

/test e2e-gcp-op

@runcom
Copy link
Member

runcom commented Nov 1, 2019

/retest

@kikisdeliveryservice
Copy link
Contributor Author

level=error msg="Cluster operator authentication Degraded is True with RouteHealthDegradedFailedGet: RouteHealthDegraded: failed to GET route: dial tcp 35.196.232.189:443: connect: connection refused"

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor Author

hitting same error as here: #1232 (comment)

reported and not retesting for now..

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@kikisdeliveryservice
Copy link
Contributor Author

/test e2e-gcp-op

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 5, 2019

@kikisdeliveryservice: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 6dac9f2 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kikisdeliveryservice
Copy link
Contributor Author

/test e2e-gcp-op

@kikisdeliveryservice
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants