Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish more prometheus stats throughout system #206

Closed
9 of 29 tasks
allada opened this issue Jul 19, 2023 · 5 comments
Closed
9 of 29 tasks

Publish more prometheus stats throughout system #206

allada opened this issue Jul 19, 2023 · 5 comments

Comments

@allada
Copy link
Member

allada commented Jul 19, 2023

Now that Prometheus is added and the API is established, we need to spread the usage around the system.

@chrisstaite
Copy link
Contributor

That's quite the project. I tried to grab some stats today, after enabling the endpoint I just got #EOF but it's probably because it was on the scheduler which doesn't have anything logging any metrics yet...

I did want to try and get some metrics off the scheduler based on the user, but Goma doesn't appear to pass the authenticated user through to the RBE. I need to look into that.

@allada
Copy link
Member Author

allada commented Jul 19, 2023

I'm about to send the behemoth PR to enable metrics for workers. I suggest waiting a little to see how that PR looks before starting to implement it.

As for enabling it for a specific user... this is going to be tricky, we'd have to pass down a lot of context to enable it for a specific strand. Right now we always gather metrics regardless of what user it is.

If this feature is really requested, here's a thought...

  • Disable metrics globally (would be trivial, since everything is currently wrapped in prometheus_utils.rs), set a global thread_local flag that prevents metric collection with env flag or disable it at runtime with a service endpoint.
  • Add a special endpoint that you can add specific IP addresses or endpoints to trigger it into a "debug" mode.
  • When users connect check to see if they are in the list of "debug" users, if they are create a new thread, setup a new tokio runtime for that thread manually and set the thread_local flag to capture metrics.

In theory we'd only capture metrics for debug endpoints in that case with near zero runtime cost.

@allada
Copy link
Member Author

allada commented Jul 20, 2023

I did want to try and get some metrics off the scheduler based on the user

I started some of the work here. To get it fully to what you want, we need a way to reset metrics (also should be trivial, since we already have a visitor), an endpoint to trigger the reset and to create a new thread and runtime for specific attached clients.
#215

@blakehatch
Copy link
Member

Taking a crack at "Total GRPC connections since server started":

  • Tracking how many clients have connected with a lazily initialized counter for thread safety.
  • Publishing just total number of client connections, not sure if other metrics are wanted like with Prometheus now publishes connected clients #230, don't want it to be bloated since it's never removing any clients unlike when only active connections are recorded

allada added a commit to allada/nativelink-fork that referenced this issue Jul 25, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 25, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 25, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 25, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 25, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 25, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 25, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit to allada/nativelink-fork that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
allada added a commit that referenced this issue Jul 26, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: #1164, #650, #384, #209
towards: #206
zbirenbaum pushed a commit to zbirenbaum/nativelink that referenced this issue Jul 27, 2024
Metrics got an entire overhaul. Instead of relying on a broken
prometheus library to publish our metrics, we now use the
`tracing` library and with OpenTelemetry that we bind together
then publish into a prometheus library.

Metrics are now mostly derive-macros. This means that the struct
can express what it wants to export and a help text. The library
will choose if it is able to export it.

Tracing now works by calling `.publish()` on the parent structs,
those structs need to call `.publish()` on all the child members
it wishes to publish data about. If a "group" is requested, use
the `group!()` macro, which under-the-hood calls `tracing::span`
with some special labels. At primitive layers, it will call the
`publish!()` macro, which will call `tracing::event!()` macro
under-the-hood with some special fields set. A custom
`tracing::Subscriber` will intercept all the events and spans
and convert them into a json-like object. This object can then
be exported as real json or encoded into other formats like
otel/prometheus.

closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209
towards: TraceMachina#206
@allada
Copy link
Member Author

allada commented Sep 12, 2024

Most of these got implemented. Closing. Create new tickets as needed.

@allada allada closed this as completed Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants