Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Metrics to the snapshot controller #142

Closed
xing-yang opened this issue Jul 12, 2019 · 25 comments · Fixed by #280 or #409
Closed

Add Metrics to the snapshot controller #142

xing-yang opened this issue Jul 12, 2019 · 25 comments · Fixed by #280 or #409
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@xing-yang
Copy link
Collaborator

Add Metrics to snapshot controller. Need to be consistent with Metrics in PV/PVC controller.

@jingxu97 to check if Shawn (?) can work on this.

@yuxiangqian
Copy link
Contributor

/assign

@yuxiangqian
Copy link
Contributor

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 30, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2019
@xing-yang
Copy link
Collaborator Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2019
@saad-ali
Copy link
Member

Fixed in #227

@saad-ali
Copy link
Member

@xing-yang @yuxiangqian You should open one a similar bug on the common snapshot controller

@saad-ali
Copy link
Member

Or I guess we can leave this open to track that

@saad-ali saad-ali reopened this Dec 31, 2019
@xing-yang
Copy link
Collaborator Author

Sure.

@yuxiangqian
Copy link
Contributor

Plan to add following metrics into common controller:

  1. end to end latency of the following snapshot operations:
    a. creation (including snapshot cutting and uploading time if any)
    b. deletion
    metric name: snapshot_operation_total_seconds
    metric type: histogram
    labels: driver, operation_name
  2. Error counts of operation
    metric name: snapshot_operation_error_count
    metric type: Counter
    labels: driver, operation_name

@saad-ali
Copy link
Member

You can combine both (latency and count) in to a single metric.

@msau42
Copy link
Collaborator

msau42 commented Mar 18, 2020

Count is strange for a total latency metric that you want to capture across retry loops in the controller.

@msau42
Copy link
Collaborator

msau42 commented Mar 18, 2020

Maybe we can get some guidance from sig-instrumentation in this area. How do we map the K8s reconciliaton model to typical metrics patterns which are generally call-based?

@yuxiangqian
Copy link
Contributor

I share the some concern as Michelle, but @saad-ali what's the major benefits of combining them into one? one less TS?
One possibility is to model reconciliation as async-call-based metrics pattern.

@saad-ali
Copy link
Member

My recommendation is based on this conversation: kubernetes-csi/external-provisioner#386 (comment)
I'll defer to @logicalhan

@yuxiangqian
Copy link
Contributor

Its certainly possible to save the error counter by introducing a new label into the latency histogram. Unlike the csi metrics, there is no well-defined return code ATM in snapshot-controller which could serve as another label. To achieve that, a more well-defined error reporting mechanism needs to be introduced in the snapshot-controller. I will do some more digging.
could be as simple as three types:

  1. Success
  2. Retryable
  3. Permanent-Failure

since the histogram metric tends to record end to end latency of an operation, the three types could also be used to determinate whether the operation has ended or not.

@logicalhan
Copy link

I share the some concern as Michelle, but @saad-ali what's the major benefits of combining them into one? one less TS?

A histogram metric is actually a series of metrics and it includes a count metric. So having another metric for counting is redundant; this is an artifact of the way prometheus exposition format expresses histograms.

@msau42
Copy link
Collaborator

msau42 commented Mar 20, 2020

@logicalhan @saad-ali the challenge here is that this metric is a total latency metric that captures the time across multiple retry loops, so the "count" is really only the number of Snapshot objects that were created.

How much value do you see in also having a per-retry loop metric? In reality, I found that per-loop metric at least for latency has not been very useful because it doesn't map to the user-perceived operations.

@yuxiangqian
Copy link
Contributor

I share the some concern as Michelle, but @saad-ali what's the major benefits of combining them into one? one less TS?

A histogram metric is actually a series of metrics and it includes a count metric. So having another metric for counting is redundant; this is an artifact of the way prometheus exposition format expresses histograms.

this is slightly different than a pure count, it's an error count not total operation count. It also makes the situation more complex as the end to end latency metric tries to record the whole duration of an operation from the controller took it to either it succeeded or permanently failed. An operation might go through multiple reconcile loops before it's done, and same errors could happen. I probably need to add a cache to record starting timestamps for each operation, and mixing error count into the cache might bring another complexity as we do not have well defined error status as CSI does.

@yuxiangqian
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@yuxiangqian: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this May 26, 2020
@yuxiangqian
Copy link
Contributor

this is not done yet. #280 only supplies the metrics utility functions, implementation in controller is still needed.

@yuxiangqian
Copy link
Contributor

cc @AndiLi99

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2020
@xing-yang
Copy link
Collaborator Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2020
@xing-yang
Copy link
Collaborator Author

/assign @ggriffiths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
8 participants