-
Notifications
You must be signed in to change notification settings - Fork 231
[OCPCLOUD-910] Add HistogramVector to track transition into different Machine phases #640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OCPCLOUD-910] Add HistogramVector to track transition into different Machine phases #640
Conversation
|
Skipping CI for Draft Pull Request. |
a5916c5 to
9e5a3e9
Compare
|
i'm +1 for this metric, i think it gives an interesting window into machine creation timings. i would love if we could add a label to track each machine as well, but i have a feel the unbounded cardinality would not be appreciated. |
|
/retest |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@JoelSpeed This is a big step forward for us to narrow down our test failures. Big 👍 from me. |
|
/hold cancel |
elmiko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still like this, holding back lgtm to let others get a look.
|
/test e2e-azure |
|
@JoelSpeed: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
alexander-demicev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
|
||
| // Update the metric after everything else has succeeded to prevent duplicate | ||
| // entries when there are failures | ||
| if phase != phaseDeleting { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should protect against duplicate calls to the same phase so we don't spoil this metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered this and thought it to be covered by L#432, we only update the metric on changes to the phase right? Do you think there's some extra nuance or case that needs to be covered here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I missed that line. We should probably refactor this function to quit early instead of putting everything in this giant if statement.
One thing to consider is protecting against empty phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I missed that line. We should probably refactor this function to quit early instead of putting everything in this giant if statement.
Ack, that would be sensible!
One thing to consider is protecting against empty phase.
How do you mean here, as in if it were empty then skipped ahead to eg Running? Wondering when this could happen, only when the status is blanked somehow? Maybe on creation of master machines during IPI? How would you suggest we protect here? Should we only record "" to "provisioning" and ignore "" to anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably outside the scope of this patch set. Perhaps we need a validatePhaseTransition() function to ensure we're doing the right thing here.
I'm confident the code will work as-is today, I'm worried about protecting against bugs in the future. We can probably just make another card for this, I think we can ship as-is.
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
This PR adds a metric which tracks the phase transitions for Machines as they are transitioning to
Provisioning,Provisioned,RunningorFailed. This will allow us to calculate, for example, what the average creation time for a Machine is and see which phases are particularly slow.Eg, the average time it took for a machine to enter each of these phases:
Since this is a histogram, we should be able to create some interesting graphs in Grafana once this is merged to allow customers to track how long different phases are taking
Potential future work: If we could get the MCS to serve the same metric and set an imaginary phase for
IgnitionFetched, we could also see how long the Machine took to get to fetching ignition config.