Skip to content

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Apr 1, 2020

This helps solve problems with flakes in CI tests that can be due to operator problems and re-uses existing and valuable visualization for the run.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 1, 2020

When debugging a failed e2e test (or a string of them), one common question is, "what is the status of clusteroperator/foo
when this particular test was running".
While we could consider one-off solutions to this, we have a solution for storing this information inside of a local

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link to the tools metioned here? This document as written is so vague as to not be understandable :|

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you link to the tools metioned here? This document as written is so vague as to not be understandable :|

It captured the idea for @damemi who I think has found the tool and has a PR to add it.

Copy link

@damemi damemi Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevekuznetsov the tool being referred to is https://github.com/mfojtik/ci-monitor-operator and we are working on adding it in openshift/origin#24845

This has inspired a longer-term goal for me to add distributed tracing throughout our components


### Goals

1. Know the state of clusteroperators, events, and pod at any given time.
Copy link
Contributor

@lilic lilic May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Know the state of clusteroperators, events, and pod at any given time

What kind of state do you need to know, curious what metrics are missing that should be sent from CI clusters, as we plan on adding an ability to search through CI cluster metrics at some point in the near future. I am curious if that would be useful to connect the different traces, the metrics we send and this what you are proposing?


## Proposal

1. Install Michal's tool in every cluster
Copy link
Contributor

@lilic lilic May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we plan on enabling Prometheus remote write for CI clusters to send some metrics and alerts in pending state where they can be queried in a timeline. Would love to get your feedback on which metrics should be included in the first batch to send out, thanks!

https://docs.google.com/document/d/1_ILVUYNBC07EHaIlqel9EL1UCWLQlKlMJtTz2Xq9Tmo/edit

approvers:
creation-date: yyyy-mm-dd
last-updated: yyyy-mm-dd
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to pick values for these headers right? Or should we just drop them from the template?

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 14, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants