Add Metrics to the snapshot controller #142

xing-yang · 2019-07-12T00:11:53Z

Add Metrics to snapshot controller. Need to be consistent with Metrics in PV/PVC controller.

@jingxu97 to check if Shawn (?) can work on this.

yuxiangqian · 2019-07-30T23:17:27Z

/assign

yuxiangqian · 2019-07-30T23:18:23Z

/kind feature

fejta-bot · 2019-10-29T00:16:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

xing-yang · 2019-10-29T00:22:01Z

/remove-lifecycle stale

saad-ali · 2019-12-31T01:50:50Z

Fixed in #227

saad-ali · 2019-12-31T01:51:26Z

@xing-yang @yuxiangqian You should open one a similar bug on the common snapshot controller

saad-ali · 2019-12-31T01:51:47Z

Or I guess we can leave this open to track that

xing-yang · 2019-12-31T04:04:26Z

Sure.

yuxiangqian · 2020-03-17T20:59:48Z

Plan to add following metrics into common controller:

end to end latency of the following snapshot operations:
a. creation (including snapshot cutting and uploading time if any)
b. deletion
metric name: snapshot_operation_total_seconds
metric type: histogram
labels: driver, operation_name
Error counts of operation
metric name: snapshot_operation_error_count
metric type: Counter
labels: driver, operation_name

saad-ali · 2020-03-17T23:45:02Z

You can combine both (latency and count) in to a single metric.

msau42 · 2020-03-18T00:14:21Z

Count is strange for a total latency metric that you want to capture across retry loops in the controller.

msau42 · 2020-03-18T00:15:13Z

Maybe we can get some guidance from sig-instrumentation in this area. How do we map the K8s reconciliaton model to typical metrics patterns which are generally call-based?

yuxiangqian · 2020-03-18T02:00:05Z

I share the some concern as Michelle, but @saad-ali what's the major benefits of combining them into one? one less TS?
One possibility is to model reconciliation as async-call-based metrics pattern.

saad-ali · 2020-03-20T00:15:10Z

My recommendation is based on this conversation: kubernetes-csi/external-provisioner#386 (comment)
I'll defer to @logicalhan

yuxiangqian · 2020-03-20T07:07:35Z

Its certainly possible to save the error counter by introducing a new label into the latency histogram. Unlike the csi metrics, there is no well-defined return code ATM in snapshot-controller which could serve as another label. To achieve that, a more well-defined error reporting mechanism needs to be introduced in the snapshot-controller. I will do some more digging.
could be as simple as three types:

Success
Retryable
Permanent-Failure

since the histogram metric tends to record end to end latency of an operation, the three types could also be used to determinate whether the operation has ended or not.

logicalhan · 2020-03-20T16:25:48Z

I share the some concern as Michelle, but @saad-ali what's the major benefits of combining them into one? one less TS?

A histogram metric is actually a series of metrics and it includes a count metric. So having another metric for counting is redundant; this is an artifact of the way prometheus exposition format expresses histograms.

msau42 · 2020-03-20T17:32:22Z

@logicalhan @saad-ali the challenge here is that this metric is a total latency metric that captures the time across multiple retry loops, so the "count" is really only the number of Snapshot objects that were created.

How much value do you see in also having a per-retry loop metric? In reality, I found that per-loop metric at least for latency has not been very useful because it doesn't map to the user-perceived operations.

yuxiangqian · 2020-03-20T17:33:16Z

I share the some concern as Michelle, but @saad-ali what's the major benefits of combining them into one? one less TS?

A histogram metric is actually a series of metrics and it includes a count metric. So having another metric for counting is redundant; this is an artifact of the way prometheus exposition format expresses histograms.

this is slightly different than a pure count, it's an error count not total operation count. It also makes the situation more complex as the end to end latency metric tries to record the whole duration of an operation from the controller took it to either it succeeded or permanently failed. An operation might go through multiple reconcile loops before it's done, and same errors could happen. I probably need to add a cache to record starting timestamps for each operation, and mixing error count into the cache might bring another complexity as we do not have well defined error status as CSI does.

yuxiangqian · 2020-05-26T21:37:49Z

/reopen

k8s-ci-robot · 2020-05-26T21:37:51Z

@yuxiangqian: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuxiangqian · 2020-05-26T21:38:46Z

this is not done yet. #280 only supplies the metrics utility functions, implementation in controller is still needed.

yuxiangqian · 2020-06-05T00:58:38Z

cc @AndiLi99

fejta-bot · 2020-09-03T01:13:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

xing-yang · 2020-09-03T19:50:38Z

/remove-lifecycle stale

xing-yang · 2020-09-14T15:04:21Z

/assign @ggriffiths

k8s-ci-robot assigned yuxiangqian Jul 30, 2019

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 30, 2019

yuxiangqian mentioned this issue Jul 31, 2019

e2e testing for snapshot controller metrics #151

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2019

saad-ali mentioned this issue Dec 31, 2019

Add prometheus metrics to CSI external-snapshotter using new csi-lib-utils library #227

Merged

saad-ali closed this as completed Dec 31, 2019

saad-ali reopened this Dec 31, 2019

yuxiangqian mentioned this issue Mar 23, 2020

add metric to common controller - basic metrics utilities #280

Merged

k8s-ci-robot closed this as completed in #280 May 26, 2020

k8s-ci-robot reopened this May 26, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2020

k8s-ci-robot assigned ggriffiths Sep 14, 2020

ggriffiths mentioned this issue Oct 28, 2020

Add snapshot controller metrics #409

Merged

k8s-ci-robot closed this as completed in #409 Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Metrics to the snapshot controller #142

Add Metrics to the snapshot controller #142

xing-yang commented Jul 12, 2019

yuxiangqian commented Jul 30, 2019

yuxiangqian commented Jul 30, 2019

fejta-bot commented Oct 29, 2019

xing-yang commented Oct 29, 2019

saad-ali commented Dec 31, 2019

saad-ali commented Dec 31, 2019

saad-ali commented Dec 31, 2019

xing-yang commented Dec 31, 2019

yuxiangqian commented Mar 17, 2020

saad-ali commented Mar 17, 2020

msau42 commented Mar 18, 2020

msau42 commented Mar 18, 2020

yuxiangqian commented Mar 18, 2020

saad-ali commented Mar 20, 2020

yuxiangqian commented Mar 20, 2020

logicalhan commented Mar 20, 2020

msau42 commented Mar 20, 2020

yuxiangqian commented Mar 20, 2020

yuxiangqian commented May 26, 2020

k8s-ci-robot commented May 26, 2020

yuxiangqian commented May 26, 2020

yuxiangqian commented Jun 5, 2020

fejta-bot commented Sep 3, 2020

xing-yang commented Sep 3, 2020

xing-yang commented Sep 14, 2020

Add Metrics to the snapshot controller #142

Add Metrics to the snapshot controller #142

Comments

xing-yang commented Jul 12, 2019

yuxiangqian commented Jul 30, 2019

yuxiangqian commented Jul 30, 2019

fejta-bot commented Oct 29, 2019

xing-yang commented Oct 29, 2019

saad-ali commented Dec 31, 2019

saad-ali commented Dec 31, 2019

saad-ali commented Dec 31, 2019

xing-yang commented Dec 31, 2019

yuxiangqian commented Mar 17, 2020

saad-ali commented Mar 17, 2020

msau42 commented Mar 18, 2020

msau42 commented Mar 18, 2020

yuxiangqian commented Mar 18, 2020

saad-ali commented Mar 20, 2020

yuxiangqian commented Mar 20, 2020

logicalhan commented Mar 20, 2020

msau42 commented Mar 20, 2020

yuxiangqian commented Mar 20, 2020

yuxiangqian commented May 26, 2020

k8s-ci-robot commented May 26, 2020

yuxiangqian commented May 26, 2020

yuxiangqian commented Jun 5, 2020

fejta-bot commented Sep 3, 2020

xing-yang commented Sep 3, 2020

xing-yang commented Sep 14, 2020