Skip to content

Conversation

@JoelSpeed
Copy link
Contributor

@JoelSpeed JoelSpeed commented Jul 14, 2020

This PR adds a metric which tracks the phase transitions for Machines as they are transitioning to Provisioning, Provisioned, Running or Failed. This will allow us to calculate, for example, what the average creation time for a Machine is and see which phases are particularly slow.

Eg, the average time it took for a machine to enter each of these phases:

Screenshot 2020-07-20 at 13 37 11

Since this is a histogram, we should be able to create some interesting graphs in Grafana once this is merged to allow customers to track how long different phases are taking

Potential future work: If we could get the MCS to serve the same metric and set an imaginary phase for IgnitionFetched, we could also see how long the Machine took to get to fetching ignition config.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 14, 2020
@openshift-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@JoelSpeed JoelSpeed force-pushed the phase-transition-metric branch from a5916c5 to 9e5a3e9 Compare July 20, 2020 12:39
@JoelSpeed JoelSpeed marked this pull request as ready for review July 20, 2020 12:46
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 20, 2020
@elmiko
Copy link
Contributor

elmiko commented Jul 20, 2020

i'm +1 for this metric, i think it gives an interesting window into machine creation timings. i would love if we could add a label to track each machine as well, but i have a feel the unbounded cardinality would not be appreciated.

@JoelSpeed
Copy link
Contributor Author

/retest

@enxebre
Copy link
Member

enxebre commented Sep 8, 2020

/retest
/approve
/hold for 4.7

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2020
@kwoodson
Copy link

@JoelSpeed This is a big step forward for us to narrow down our test failures. Big 👍 from me.

@JoelSpeed
Copy link
Contributor Author

/hold cancel
/retest

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still like this, holding back lgtm to let others get a look.

@JoelSpeed
Copy link
Contributor Author

JoelSpeed commented Oct 14, 2020

/test e2e-azure
/test e2e-aws
/test e2e-gcp

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 14, 2020

@JoelSpeed: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp 9e5a3e9 link /test e2e-gcp

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2020

// Update the metric after everything else has succeeded to prevent duplicate
// entries when there are failures
if phase != phaseDeleting {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should protect against duplicate calls to the same phase so we don't spoil this metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered this and thought it to be covered by L#432, we only update the metric on changes to the phase right? Do you think there's some extra nuance or case that needs to be covered here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I missed that line. We should probably refactor this function to quit early instead of putting everything in this giant if statement.

One thing to consider is protecting against empty phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I missed that line. We should probably refactor this function to quit early instead of putting everything in this giant if statement.

Ack, that would be sensible!

One thing to consider is protecting against empty phase.

How do you mean here, as in if it were empty then skipped ahead to eg Running? Wondering when this could happen, only when the status is blanked somehow? Maybe on creation of master machines during IPI? How would you suggest we protect here? Should we only record "" to "provisioning" and ignore "" to anything else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably outside the scope of this patch set. Perhaps we need a validatePhaseTransition() function to ensure we're doing the right thing here.

I'm confident the code will work as-is today, I'm worried about protecting against bugs in the future. We can probably just make another card for this, I think we can ship as-is.

@JoelSpeed
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit f4de87c into openshift:master Oct 19, 2020
@JoelSpeed JoelSpeed deleted the phase-transition-metric branch October 20, 2020 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants