-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add basic metric functionality #971
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #971 +/- ##
===========================================
- Coverage 57.76% 57.73% -0.03%
===========================================
Files 170 171 +1
Lines 19440 19478 +38
===========================================
+ Hits 11229 11246 +17
- Misses 7223 7239 +16
- Partials 988 993 +5
|
7a46c0b
to
ba938f4
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me, but I am only guessing as to how it would actually be used in Defra - it would be very handy if you could hook it up to some production code in this PR so that I/we can properly see this.
func (m *Meter) GetSyncHistogram( | ||
name string, | ||
unit unit.Unit, | ||
) (syncint64.Histogram, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Reading this PR it looks like syncint64.Histogram
is the main type that will be passed around throughout Defra - e.g. held by the planner.countNode
object? If so it looks like it would be worth introducing an interface for it, so we dont have hundreds of references to syncint64.Histogram
throughout our codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this point too, in the next work (using this in execution explain) I was thinking about narrowing down all the required operations and then just providing those methods such that syncint64.xyz
resource held by the Meter
or Metric
abstraction and then it will be operated on instead of returning, which would avoid this pollution (so Meter would be held by the planner.countNode
for example). Please do call me out if the execution PR misses this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the next work (using this in execution explain)
I held off asking about this - your plan is to use it in that next? Before hooking up metrics for metrics sake/directly in Defra?
everything else :)
Would suggest copying a link to this discussion into the Execution ticket in that case, or we both might forget :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your plan is to use it in that next? Before hooking up metrics for metrics sake/directly in Defra?
Yes, plan is to use it in execution explain next (for which would need to be used by Defra, i.e. planner). By directly in Defra are you asking if I will use the sdk or other open telemetry metric stuff directly, or asking about the registering through the config to monitor other non-explain metrics?
Would suggest copying a link to this discussion into the Execution ticket in that case, or we both might forget :)
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or asking about the registering through the config to monitor other non-explain metrics
This one :)
done
Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open telemetry can be used in a localized way, and a global way. For example for non-explain metrics, say for example we wanted to monitor/observe in a periodic way some memory stats and wanted to have them keep being monitored for as long as the server is up, then it makes sense to register the global meter that keeps observing and in an async fashion keeps exporting the metrics data to the console or to a Prometheus like tool.
This above approach is somewhat independent of what I am going to do next, but if that is a priority then it can be worked on and this package can be further expanded. Should not be too bad once the basic telemetry stuff is in (with the small hiccup of not having much documentation atm haha).
For explain stuff as we only want to localized metrics that are collected once the request is made, we don't need to have a global metric meter and can just build the metric data manually per request using what is in the package currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global metric meter
Slightly off-topic now, but this doesn't have to be global, and I'd suggest for lots of stuff that is a poor way of doing it - you can gain lots of very useful information by collecting the metrics on a more granular level and then aggregating it externally - e.g. collect time spent in countNode on a per request basis, and then look at the average, count and variance of that dataset externally (to perhaps identify a common/high-prio performance bottleneck).
I wasn't sure if we we doing this first/next or the explain stuff - if doing the explain stuff next it is easier to kick questions RE interface design down the road a bit as the explain stuff is fairly isolated IMO.
Thanks for the explanation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly off-topic now, but this doesn't have to be global, and I'd suggest for lots of stuff that is a poor way of doing it - you can gain lots of very useful information by collecting the metrics on a more granular level and then aggregating it externally - e.g. collect time spent in countNode on a per request basis, and then look at the average, count and variance of that dataset externally (to perhaps identify a common/high-prio performance bottleneck).
I agree, was one of the reason I wanted to try this approach as well. Yes as you mentioned that is what the plan is when I say "localized" register at the config level and then as you said pass the object down in a granular level and then aggregate it externally (only difference for non-explain metrics will be that the aggregation would be periodic and being observed rather than being returned once after a request is made).
I wasn't sure if we we doing this first/next or the explain stuff - if doing the explain stuff next it is easier to kick questions RE interface design down the road a bit as the explain stuff is fairly isolated IMO.
Yes exactly, glad we are on the same page.
Thanks for the explanation :)
Anytime, actually helped me visualize things better too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one suggestion as Andy beat me to the other ones :)
Nothing strong, but I have a minor preference for |
Sorry about that, hopefully, the execution explain PR will make more sense as it makes use of this. I don't have the groundwork needed in this PR that needs to be done before I can hook it up to the exec explain stuff. But did the tests make sense? LMK if you want me to make something clearer in the tests. |
No that's all good if Execution-Explain is the next planned step for this :) Thanks - approving |
Haha that's what I had before but then I remembered in Go it's considered a sin to have a plural word as a package name haha. |
😁 [glances nervously at errors, events, tests] |
ba938f4
to
429a81d
Compare
429a81d
to
5002141
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
5002141
to
35a9a5e
Compare
* PR: Add the metric functionality that would be needed for execution explain. * PR: Add tests for the metric functionality. * PR: Pull in opentelemetry (update deps). * PR: Cleanup and Fixup.
Relevant issue(s)
Resolves #970
Related Epic: #293
Description
This PR makes use of open telemetry to setup manual instrumentation features for DefraDB.
Note:
Metric
interface is there to wrap up commonMetric
functionality that would be easy to replace in the future (I am not sure if it's worth maintaining this abstraction in the future as some functionality can not be abstracted and will fundamentally be different across different metric implementations).metricdata
no other SDK functionality is leaked outside themetric
package.MeterProvider
functionality which might be good to look into for when we do observability of non-execution request related things.Future work:
Prometheus
,Jaeger
etc.)Question:
metric
package?Tasks
How has this been tested?
Unit test and CI.
Specify the platform(s) on which this was tested: