Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add basic metric functionality #971

Merged
merged 4 commits into from
Dec 17, 2022
Merged

Conversation

shahzadlone
Copy link
Member

@shahzadlone shahzadlone commented Dec 16, 2022

Relevant issue(s)

Resolves #970
Related Epic: #293

Description

This PR makes use of open telemetry to setup manual instrumentation features for DefraDB.

Note:

  • The Metric interface is there to wrap up common Metric functionality that would be easy to replace in the future (I am not sure if it's worth maintaining this abstraction in the future as some functionality can not be abstracted and will fundamentally be different across different metric implementations).
  • With the exception of metricdata no other SDK functionality is leaked outside the metric package.
  • Async functionality is not wrapped yet (will add on as needed basis).
  • There is also a global MeterProvider functionality which might be good to look into for when we do observability of non-execution request related things.
  • The configuration option to explicitly turn metrics off and on is not done here as currently, it is somewhat dormant code wrt defra, and the immediate work for execution explain will always require metrics.

Future work:

  • While implementing this I tried to keep the implementation of the execution flow in mind, seems like the basic instruments of telemetry should suffice, however, because we will be restructuring metrics into a plangraph-like dag of children and parent metrics from different nodes, we might need to flatten (or restructure) it in the future depending on the type of exporter we use (Prometheus, Jaeger etc.)
  • UpDownCounter

Question:

  • Any strong opinions on the location of the metric package?

Tasks

  • I made sure the code is well commented, particularly hard-to-understand areas.
  • I made sure the repository-held documentation is changed accordingly.
  • I made sure the pull request title adheres to the conventional commit style (the subset used in the project can be found in tools/configs/chglog/config.yml).
  • I made sure to discuss its limitations such as threats to validity, vulnerability to mistake and misuse, robustness to invalidation of assumptions, resource requirements, ...

How has this been tested?

Unit test and CI.

Specify the platform(s) on which this was tested:

  • Manjaro WSL2

@codecov
Copy link

codecov bot commented Dec 16, 2022

Codecov Report

Merging #971 (35a9a5e) into develop (ff13743) will decrease coverage by 0.02%.
The diff coverage is 65.78%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #971      +/-   ##
===========================================
- Coverage    57.76%   57.73%   -0.03%     
===========================================
  Files          170      171       +1     
  Lines        19440    19478      +38     
===========================================
+ Hits         11229    11246      +17     
- Misses        7223     7239      +16     
- Partials       988      993       +5     
Impacted Files Coverage Δ
metric/metric.go 65.78% <65.78%> (ø)
net/client.go 78.04% <0.00%> (-4.88%) ⬇️
net/peer.go 44.01% <0.00%> (-2.16%) ⬇️
net/pb/net.pb.go 15.66% <0.00%> (+0.16%) ⬆️

@shahzadlone shahzadlone requested a review from a team December 16, 2022 15:16
@shahzadlone shahzadlone self-assigned this Dec 16, 2022
@shahzadlone shahzadlone added feature New feature or request action/no-benchmark Skips the action that runs the benchmark. labels Dec 16, 2022
@shahzadlone shahzadlone added this to the DefraDB v0.4 milestone Dec 16, 2022
@source-devs

This comment was marked as off-topic.

Copy link
Contributor

@AndrewSisley AndrewSisley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, but I am only guessing as to how it would actually be used in Defra - it would be very handy if you could hook it up to some production code in this PR so that I/we can properly see this.

metric/metric.go Show resolved Hide resolved
metric/metric.go Outdated Show resolved Hide resolved
metric/metric.go Outdated Show resolved Hide resolved
func (m *Meter) GetSyncHistogram(
name string,
unit unit.Unit,
) (syncint64.Histogram, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Reading this PR it looks like syncint64.Histogram is the main type that will be passed around throughout Defra - e.g. held by the planner.countNode object? If so it looks like it would be worth introducing an interface for it, so we dont have hundreds of references to syncint64.Histogram throughout our codebase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this point too, in the next work (using this in execution explain) I was thinking about narrowing down all the required operations and then just providing those methods such that syncint64.xyz resource held by the Meter or Metric abstraction and then it will be operated on instead of returning, which would avoid this pollution (so Meter would be held by the planner.countNode for example). Please do call me out if the execution PR misses this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the next work (using this in execution explain)

I held off asking about this - your plan is to use it in that next? Before hooking up metrics for metrics sake/directly in Defra?

everything else :)

Would suggest copying a link to this discussion into the Execution ticket in that case, or we both might forget :)

Copy link
Member Author

@shahzadlone shahzadlone Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your plan is to use it in that next? Before hooking up metrics for metrics sake/directly in Defra?

Yes, plan is to use it in execution explain next (for which would need to be used by Defra, i.e. planner). By directly in Defra are you asking if I will use the sdk or other open telemetry metric stuff directly, or asking about the registering through the config to monitor other non-explain metrics?

Would suggest copying a link to this discussion into the Execution ticket in that case, or we both might forget :)

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or asking about the registering through the config to monitor other non-explain metrics

This one :)

done

Thanks :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open telemetry can be used in a localized way, and a global way. For example for non-explain metrics, say for example we wanted to monitor/observe in a periodic way some memory stats and wanted to have them keep being monitored for as long as the server is up, then it makes sense to register the global meter that keeps observing and in an async fashion keeps exporting the metrics data to the console or to a Prometheus like tool.

This above approach is somewhat independent of what I am going to do next, but if that is a priority then it can be worked on and this package can be further expanded. Should not be too bad once the basic telemetry stuff is in (with the small hiccup of not having much documentation atm haha).

For explain stuff as we only want to localized metrics that are collected once the request is made, we don't need to have a global metric meter and can just build the metric data manually per request using what is in the package currently.

Copy link
Contributor

@AndrewSisley AndrewSisley Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global metric meter

Slightly off-topic now, but this doesn't have to be global, and I'd suggest for lots of stuff that is a poor way of doing it - you can gain lots of very useful information by collecting the metrics on a more granular level and then aggregating it externally - e.g. collect time spent in countNode on a per request basis, and then look at the average, count and variance of that dataset externally (to perhaps identify a common/high-prio performance bottleneck).

I wasn't sure if we we doing this first/next or the explain stuff - if doing the explain stuff next it is easier to kick questions RE interface design down the road a bit as the explain stuff is fairly isolated IMO.

Thanks for the explanation :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly off-topic now, but this doesn't have to be global, and I'd suggest for lots of stuff that is a poor way of doing it - you can gain lots of very useful information by collecting the metrics on a more granular level and then aggregating it externally - e.g. collect time spent in countNode on a per request basis, and then look at the average, count and variance of that dataset externally (to perhaps identify a common/high-prio performance bottleneck).

I agree, was one of the reason I wanted to try this approach as well. Yes as you mentioned that is what the plan is when I say "localized" register at the config level and then as you said pass the object down in a granular level and then aggregate it externally (only difference for non-explain metrics will be that the aggregation would be periodic and being observed rather than being returned once after a request is made).

I wasn't sure if we we doing this first/next or the explain stuff - if doing the explain stuff next it is easier to kick questions RE interface design down the road a bit as the explain stuff is fairly isolated IMO.

Yes exactly, glad we are on the same page.

Thanks for the explanation :)

Anytime, actually helped me visualize things better too.

Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one suggestion as Andy beat me to the other ones :)

metric/metric.go Outdated Show resolved Hide resolved
@AndrewSisley
Copy link
Contributor

Any strong opinions on the location of the metric package?

Nothing strong, but I have a minor preference for metrics over metric

@shahzadlone
Copy link
Member Author

It looks good to me, but I am only guessing as to how it would actually be used in Defra - it would be very handy if you could hook it up to some production code in this PR so that I/we can properly see this.

Sorry about that, hopefully, the execution explain PR will make more sense as it makes use of this. I don't have the groundwork needed in this PR that needs to be done before I can hook it up to the exec explain stuff. But did the tests make sense? LMK if you want me to make something clearer in the tests.

@AndrewSisley
Copy link
Contributor

Sorry about that, hopefully, the execution explain PR will make more sense as it makes use of this. I don't have the groundwork needed in this PR that needs to be done before I can hook it up to the exec explain stuff. But did the tests make sense? LMK if you want me to make something clearer in the tests.

No that's all good if Execution-Explain is the next planned step for this :) Thanks - approving

@shahzadlone
Copy link
Member Author

Any strong opinions on the location of the metric package?

Nothing strong, but I have a minor preference for metrics over metric

Haha that's what I had before but then I remembered in Go it's considered a sin to have a plural word as a package name haha.

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Dec 16, 2022

Haha that's what I had before but then I remembered in Go it's considered a sin to have a plural word as a package name haha.

😁 [glances nervously at errors, events, tests]

@fredcarle
Copy link
Collaborator

Haha that's what I had before but then I remembered in Go it's considered a sin to have a plural word as a package name haha.

😁 [glances nervously at errors, events, tests]

errors has an s as to not conflict with the native error type. events and tests` should probably be singular.

Copy link
Collaborator

@fredcarle fredcarle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shahzadlone shahzadlone merged commit 2e5626f into develop Dec 17, 2022
@shahzadlone shahzadlone deleted the lone/feature/metrics branch December 17, 2022 21:03
shahzadlone added a commit to shahzadlone/defradb that referenced this pull request Feb 23, 2024
* PR: Add the metric functionality that would be needed for execution
explain.

* PR: Add tests for the metric functionality.

* PR: Pull in opentelemetry (update deps).

* PR: Cleanup and Fixup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/no-benchmark Skips the action that runs the benchmark. feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add basic metrics using telemetry
4 participants