feat: Add basic metric functionality #971

shahzadlone · 2022-12-16T14:56:16Z

Relevant issue(s)

Resolves #970
Related Epic: #293

Description

This PR makes use of open telemetry to setup manual instrumentation features for DefraDB.

Note:

The Metric interface is there to wrap up common Metric functionality that would be easy to replace in the future (I am not sure if it's worth maintaining this abstraction in the future as some functionality can not be abstracted and will fundamentally be different across different metric implementations).
With the exception of metricdata no other SDK functionality is leaked outside the metric package.
Async functionality is not wrapped yet (will add on as needed basis).
There is also a global MeterProvider functionality which might be good to look into for when we do observability of non-execution request related things.
The configuration option to explicitly turn metrics off and on is not done here as currently, it is somewhat dormant code wrt defra, and the immediate work for execution explain will always require metrics.

Future work:

While implementing this I tried to keep the implementation of the execution flow in mind, seems like the basic instruments of telemetry should suffice, however, because we will be restructuring metrics into a plangraph-like dag of children and parent metrics from different nodes, we might need to flatten (or restructure) it in the future depending on the type of exporter we use (Prometheus, Jaeger etc.)
UpDownCounter

Question:

Any strong opinions on the location of the metric package?

Tasks

I made sure the code is well commented, particularly hard-to-understand areas.
I made sure the repository-held documentation is changed accordingly.
I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

Unit test and CI.

Specify the platform(s) on which this was tested:

Manjaro WSL2

codecov · 2022-12-16T15:05:26Z

Codecov Report

Merging #971 (35a9a5e) into develop (ff13743) will decrease coverage by 0.02%.
The diff coverage is 65.78%.

@@             Coverage Diff             @@
##           develop     #971      +/-   ##
===========================================
- Coverage    57.76%   57.73%   -0.03%     
===========================================
  Files          170      171       +1     
  Lines        19440    19478      +38     
===========================================
+ Hits         11229    11246      +17     
- Misses        7223     7239      +16     
- Partials       988      993       +5

Impacted Files	Coverage Δ
metric/metric.go	`65.78% <65.78%> (ø)`
net/client.go	`78.04% <0.00%> (-4.88%)`	⬇️
net/peer.go	`44.01% <0.00%> (-2.16%)`	⬇️
net/pb/net.pb.go	`15.66% <0.00%> (+0.16%)`	⬆️

AndrewSisley

It looks good to me, but I am only guessing as to how it would actually be used in Defra - it would be very handy if you could hook it up to some production code in this PR so that I/we can properly see this.

metric/metric.go

AndrewSisley · 2022-12-16T15:48:10Z

metric/metric.go

+func (m *Meter) GetSyncHistogram(
+	name string,
+	unit unit.Unit,
+) (syncint64.Histogram, error) {


thought: Reading this PR it looks like syncint64.Histogram is the main type that will be passed around throughout Defra - e.g. held by the planner.countNode object? If so it looks like it would be worth introducing an interface for it, so we dont have hundreds of references to syncint64.Histogram throughout our codebase

I thought about this point too, in the next work (using this in execution explain) I was thinking about narrowing down all the required operations and then just providing those methods such that syncint64.xyz resource held by the Meter or Metric abstraction and then it will be operated on instead of returning, which would avoid this pollution (so Meter would be held by the planner.countNode for example). Please do call me out if the execution PR misses this.

in the next work (using this in execution explain)

I held off asking about this - your plan is to use it in that next? Before hooking up metrics for metrics sake/directly in Defra?

everything else :)

Would suggest copying a link to this discussion into the Execution ticket in that case, or we both might forget :)

your plan is to use it in that next? Before hooking up metrics for metrics sake/directly in Defra?

Yes, plan is to use it in execution explain next (for which would need to be used by Defra, i.e. planner). By directly in Defra are you asking if I will use the sdk or other open telemetry metric stuff directly, or asking about the registering through the config to monitor other non-explain metrics?

Would suggest copying a link to this discussion into the Execution ticket in that case, or we both might forget :)

done

or asking about the registering through the config to monitor other non-explain metrics

This one :)

done

Thanks :)

Open telemetry can be used in a localized way, and a global way. For example for non-explain metrics, say for example we wanted to monitor/observe in a periodic way some memory stats and wanted to have them keep being monitored for as long as the server is up, then it makes sense to register the global meter that keeps observing and in an async fashion keeps exporting the metrics data to the console or to a Prometheus like tool.

This above approach is somewhat independent of what I am going to do next, but if that is a priority then it can be worked on and this package can be further expanded. Should not be too bad once the basic telemetry stuff is in (with the small hiccup of not having much documentation atm haha).

For explain stuff as we only want to localized metrics that are collected once the request is made, we don't need to have a global metric meter and can just build the metric data manually per request using what is in the package currently.

global metric meter

Slightly off-topic now, but this doesn't have to be global, and I'd suggest for lots of stuff that is a poor way of doing it - you can gain lots of very useful information by collecting the metrics on a more granular level and then aggregating it externally - e.g. collect time spent in countNode on a per request basis, and then look at the average, count and variance of that dataset externally (to perhaps identify a common/high-prio performance bottleneck).

I wasn't sure if we we doing this first/next or the explain stuff - if doing the explain stuff next it is easier to kick questions RE interface design down the road a bit as the explain stuff is fairly isolated IMO.

Thanks for the explanation :)

Slightly off-topic now, but this doesn't have to be global, and I'd suggest for lots of stuff that is a poor way of doing it - you can gain lots of very useful information by collecting the metrics on a more granular level and then aggregating it externally - e.g. collect time spent in countNode on a per request basis, and then look at the average, count and variance of that dataset externally (to perhaps identify a common/high-prio performance bottleneck).

I agree, was one of the reason I wanted to try this approach as well. Yes as you mentioned that is what the plan is when I say "localized" register at the config level and then as you said pass the object down in a granular level and then aggregate it externally (only difference for non-explain metrics will be that the aggregation would be periodic and being observed rather than being returned once after a request is made).

I wasn't sure if we we doing this first/next or the explain stuff - if doing the explain stuff next it is easier to kick questions RE interface design down the road a bit as the explain stuff is fairly isolated IMO.

Yes exactly, glad we are on the same page.

Thanks for the explanation :)

Anytime, actually helped me visualize things better too.

fredcarle

Just one suggestion as Andy beat me to the other ones :)

metric/metric.go

AndrewSisley · 2022-12-16T16:13:22Z

Any strong opinions on the location of the metric package?

Nothing strong, but I have a minor preference for metrics over metric

shahzadlone · 2022-12-16T16:23:56Z

It looks good to me, but I am only guessing as to how it would actually be used in Defra - it would be very handy if you could hook it up to some production code in this PR so that I/we can properly see this.

Sorry about that, hopefully, the execution explain PR will make more sense as it makes use of this. I don't have the groundwork needed in this PR that needs to be done before I can hook it up to the exec explain stuff. But did the tests make sense? LMK if you want me to make something clearer in the tests.

AndrewSisley · 2022-12-16T16:25:31Z

Sorry about that, hopefully, the execution explain PR will make more sense as it makes use of this. I don't have the groundwork needed in this PR that needs to be done before I can hook it up to the exec explain stuff. But did the tests make sense? LMK if you want me to make something clearer in the tests.

No that's all good if Execution-Explain is the next planned step for this :) Thanks - approving

shahzadlone · 2022-12-16T16:26:54Z

Any strong opinions on the location of the metric package?

Nothing strong, but I have a minor preference for metrics over metric

Haha that's what I had before but then I remembered in Go it's considered a sin to have a plural word as a package name haha.

AndrewSisley · 2022-12-16T16:28:23Z

Haha that's what I had before but then I remembered in Go it's considered a sin to have a plural word as a package name haha.

😁 [glances nervously at errors, events, tests]

fredcarle · 2022-12-16T22:37:26Z

Haha that's what I had before but then I remembered in Go it's considered a sin to have a plural word as a package name haha.

😁 [glances nervously at errors, events, tests]

errors has an s as to not conflict with the native error type. events and tests` should probably be singular.

fredcarle

LGTM

explain.

* PR: Add the metric functionality that would be needed for execution explain. * PR: Add tests for the metric functionality. * PR: Pull in opentelemetry (update deps). * PR: Cleanup and Fixup.

shahzadlone force-pushed the lone/feature/metrics branch from 7a46c0b to ba938f4 Compare December 16, 2022 15:13

shahzadlone requested a review from a team December 16, 2022 15:16

shahzadlone self-assigned this Dec 16, 2022

shahzadlone added feature New feature or request action/no-benchmark Skips the action that runs the benchmark. labels Dec 16, 2022

shahzadlone added this to the DefraDB v0.4 milestone Dec 16, 2022

This comment was marked as off-topic.

Sign in to view

AndrewSisley reviewed Dec 16, 2022

View reviewed changes

fredcarle reviewed Dec 16, 2022

View reviewed changes

metric/metric.go Outdated Show resolved Hide resolved

AndrewSisley approved these changes Dec 16, 2022

View reviewed changes

shahzadlone force-pushed the lone/feature/metrics branch from ba938f4 to 429a81d Compare December 16, 2022 16:34

shahzadlone mentioned this pull request Dec 16, 2022

Explain Request - Execution #326

Closed

shahzadlone force-pushed the lone/feature/metrics branch from 429a81d to 5002141 Compare December 16, 2022 20:25

fredcarle approved these changes Dec 16, 2022

View reviewed changes

shahzadlone added 4 commits December 17, 2022 15:39

PR: Add the metric functionality that would be needed for execution

a8a9608

explain.

PR: Add tests for the metric functionality.

7339310

PR: Pull in opentelemetry (update deps).

fd694ae

PR: Cleanup and Fixup.

35a9a5e

shahzadlone force-pushed the lone/feature/metrics branch from 5002141 to 35a9a5e Compare December 17, 2022 20:40

shahzadlone merged commit 2e5626f into develop Dec 17, 2022

shahzadlone deleted the lone/feature/metrics branch December 17, 2022 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add basic metric functionality #971

feat: Add basic metric functionality #971

shahzadlone commented Dec 16, 2022 •

edited

Loading

codecov bot commented Dec 16, 2022 •

edited

Loading

This comment was marked as off-topic.

AndrewSisley left a comment

AndrewSisley Dec 16, 2022

shahzadlone Dec 16, 2022

AndrewSisley Dec 16, 2022

shahzadlone Dec 16, 2022 •

edited

Loading

AndrewSisley Dec 16, 2022

shahzadlone Dec 16, 2022

AndrewSisley Dec 16, 2022 •

edited

Loading

shahzadlone Dec 16, 2022

fredcarle left a comment

AndrewSisley commented Dec 16, 2022

shahzadlone commented Dec 16, 2022

AndrewSisley commented Dec 16, 2022

shahzadlone commented Dec 16, 2022

AndrewSisley commented Dec 16, 2022 •

edited

Loading

fredcarle commented Dec 16, 2022

fredcarle left a comment

feat: Add basic metric functionality #971

feat: Add basic metric functionality #971

Conversation

shahzadlone commented Dec 16, 2022 • edited Loading

Relevant issue(s)

Description

Tasks

How has this been tested?

codecov bot commented Dec 16, 2022 • edited Loading

Codecov Report

This comment was marked as off-topic.

AndrewSisley left a comment

Choose a reason for hiding this comment

AndrewSisley Dec 16, 2022

Choose a reason for hiding this comment

shahzadlone Dec 16, 2022

Choose a reason for hiding this comment

AndrewSisley Dec 16, 2022

Choose a reason for hiding this comment

shahzadlone Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

AndrewSisley Dec 16, 2022

Choose a reason for hiding this comment

shahzadlone Dec 16, 2022

Choose a reason for hiding this comment

AndrewSisley Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

shahzadlone Dec 16, 2022

Choose a reason for hiding this comment

fredcarle left a comment

Choose a reason for hiding this comment

AndrewSisley commented Dec 16, 2022

shahzadlone commented Dec 16, 2022

AndrewSisley commented Dec 16, 2022

shahzadlone commented Dec 16, 2022

AndrewSisley commented Dec 16, 2022 • edited Loading

fredcarle commented Dec 16, 2022

fredcarle left a comment

Choose a reason for hiding this comment

shahzadlone commented Dec 16, 2022 •

edited

Loading

codecov bot commented Dec 16, 2022 •

edited

Loading

shahzadlone Dec 16, 2022 •

edited

Loading

AndrewSisley Dec 16, 2022 •

edited

Loading

AndrewSisley commented Dec 16, 2022 •

edited

Loading