Observing: traces, logs and metrics #560

Mirko-von-Leipzig · 2024-12-04T09:53:14Z

Mirko-von-Leipzig
Dec 4, 2024
Collaborator

We need to make our node (and other binaries) observable. Let's discuss our options..

Related issues: #144 and miden-base/1004

Overview

Producer crates

I believe the two main candidates here are tracing and metrics (or some variation thereof). Both provide the emitters and there are many consumer crates to ingest/transform these.

Wide events

Personally I'm a fan of wide events. As a summary: instead of separate logs, traces and metrics just use structured events and the consumer can decide how to interpret things.

I believe tracing (despite the name) essentially already covers this i.e. I'm suggesting we use only tracing, including for metrics.

This does have a downside - wide events are much more larger than pure metrics. What I'm hoping is that tracing/metrics will be as simple as #[tracing::instrument] in the appropriate places that are then consumed to generate logs, metrics and traces.

The major benefit to me is that we decouple the production of events, from the interpretation i.e. don't worry about metric vs log info, just instrument the main methods and let the consumers present the data as appropriate. Of course still keeping the data size in mind i.e. don't emit the entire block's transaction data - probably just transaction IDs are good enough.

Consumers

I'm aware of grafana for metrics, and jaeger for traces. Here is a semi-recent blog post with setup and examples.

Similarly we need to decide on using a more general protocol like open-telemetry, or specific ones e.g. prometheus for metrics. Personally open-telemetry has always been confusing af as a standard to me, but I'm not aware of a better option..

We may also want to reach out to devops to understand what's in use within Polygon.

Node specifics

#144 had some outlines on what should be tracked, though I believe that was looking too far ahead ito what we have available to us now.

Trace `target=`

Currently we are overriding the default target by doing target = COMPONENT in all trace events. This is quite error prone as its easy to forget, and in some cases we also use target=miden-store hardcoded instead.

There is probably some way to avoid this, using custom layers or by manipulating the log subscriber instead.

Metrics

We definitely want metrics for the various RPC interactions. This should be easy using #[instrument], and should include the timings.

In general we want to instrument/time at minimum:

mempool methods (including lock contention)
mempool status i.e. transactions in-queue, batches in-queue etc
all component's rpc queries
database queries
block building and proving

Traces

What would be amazing is to visualize the lifetime of RPC requests, transactions, notes, batches, blocks etc. As an example, for a specific transaction it would be amazing to query jaerger by transaction ID and have it plot a timeline:

received by rpc
   proof verified
      received by block-producer
          input received from store
              added to mempool
                promoted to transaction graph root
                  selected to batch `x`
                      `x` proved
                          added to block `n`
                               `n` proved
                                   `n` committed

This should be possible if each of these events includes the transaction ID(s), which should be somewhat trivial to instrument. I'm not familiar enough with jaeger and trace consumers to really know how cheap/easy this is to do.

But I do believe this could be great for debugging live systems; though maybe its overkill and logs are good enough..

Something to consider is assigning a UUID to each RPC request so that we can correlate internal rpc calls e.g. user request -> rpc -> store.

bobbinth · 2024-12-05T05:51:51Z

bobbinth
Dec 5, 2024
Maintainer

I believe tracing (despite the name) essentially already covers this i.e. I'm suggesting we use only tracing, including for metrics.

Yes, we already use tracing quite extensively in several repos (node, vm), and if we can get away with just tracing for metrics as well, that would be the preferred route.

I'm aware of grafana for metrics, and jaeger for traces. Here is a semi-recent blog post with setup and examples.

Similarly we need to decide on using a more general protocol like open-telemetry, or specific ones e.g. prometheus for metrics. Personally open-telemetry has always been confusing af as a standard to me, but I'm not aware of a better option..

If we go with grafana + jaeger - do we also need to use open-telemetry or prometheus? Or are these "either-or" options?

9 replies

Mirko-von-Leipzig Dec 9, 2024
Collaborator Author

Thanks @vcastellm.

iiuc then you have metrics that are completely separate from tracing? As in, you don't only instrument functions with tracing but also actively interact with the opentelemetry metrics within functions?

I was hoping one could only use tracing and then have an additional tracing layer that somehow converts certain traces into metrics. Or possibly does this on the dashboard side.

vcastellm Dec 9, 2024

Yup @Mirko-von-Leipzig, they use OpenTel sdk with prometheus exporter, not sure if tracing can be integrated to work with opentelemetry and only instrument using tracing, that would be desirable I guess, maybe worth investigating.

Converting traces to metrics sounds like something pretty orthogonal tho.

Mirko-von-Leipzig Dec 9, 2024
Collaborator Author

It depends on interpretation. I like the concept of wide events aka structured logs. From that perspective traces and metrics are just different views of the same data.

Or put differently, it would be great to just emit structured logs (which is effectively what tracing does) and have the consumer interpret them as needed.

vcastellm Dec 9, 2024

Yes, wide events or metrics from structured logs is the desired status see this convo https://x.com/vcastellm/status/1829776669558583673?t=E2oFKQdPdaryzuYRfc3T_w&s=09 but it's not the current status in many Obs. providers, that include Datadog, so in practice that's what we can do now.

Mirko-von-Leipzig Dec 9, 2024
Collaborator Author

Ah right sorry I think I misinterpreted what you meant by orthogonal :) Thanks

Mirko-von-Leipzig · 2024-12-10T09:19:52Z

Mirko-von-Leipzig
Dec 10, 2024
Collaborator Author

This reddit comment recommends prometheus_client over prometheus (if we decide to go that route).

0 replies

Mirko-von-Leipzig · 2024-12-10T09:29:47Z

Mirko-von-Leipzig
Dec 10, 2024
Collaborator Author

It also appears that services like datadog have built-in capability for converting structured logs into metrics.

2 replies

vcastellm Dec 11, 2024

Yeah, and it's super useful, we use it in Polygon for several metrics, as I mentioned the problem with logs is that they are expensive as f. so consider that before going with this approach, the most notable issue is retention, we'll be decreasing log retention to improve costs, that can affect metrics history dramatically.

Mirko-von-Leipzig Dec 11, 2024
Collaborator Author

Makes sense, its a pity - I thought storage is cheap. Though I guess that's the issue with SaaS; always just another lever for extracting value.

So I guess the overall picture is either separate metrics / traces within the instrumented code already, or have some post-processing that converts traces into metrics. For the latter then the question is where that post-processing occurs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observing: traces, logs and metrics #560

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 11 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Observing: traces, logs and metrics #560

Mirko-von-Leipzig Dec 4, 2024 Collaborator

Overview

Producer crates

Wide events

Consumers

Node specifics

Trace target=

Metrics

Traces

Replies: 3 comments · 11 replies

bobbinth Dec 5, 2024 Maintainer

Mirko-von-Leipzig Dec 9, 2024 Collaborator Author

vcastellm Dec 9, 2024

Mirko-von-Leipzig Dec 9, 2024 Collaborator Author

vcastellm Dec 9, 2024

Mirko-von-Leipzig Dec 9, 2024 Collaborator Author

Mirko-von-Leipzig Dec 10, 2024 Collaborator Author

Mirko-von-Leipzig Dec 10, 2024 Collaborator Author

vcastellm Dec 11, 2024

Mirko-von-Leipzig Dec 11, 2024 Collaborator Author

Mirko-von-Leipzig
Dec 4, 2024
Collaborator

Trace `target=`

Replies: 3 comments 11 replies

bobbinth
Dec 5, 2024
Maintainer

Mirko-von-Leipzig Dec 9, 2024
Collaborator Author

Mirko-von-Leipzig Dec 9, 2024
Collaborator Author

Mirko-von-Leipzig Dec 9, 2024
Collaborator Author

Mirko-von-Leipzig
Dec 10, 2024
Collaborator Author

Mirko-von-Leipzig
Dec 10, 2024
Collaborator Author

Mirko-von-Leipzig Dec 11, 2024
Collaborator Author