Observing: traces, logs and metrics #560
Replies: 3 comments 11 replies
-
Yes, we already use
If we go with |
Beta Was this translation helpful? Give feedback.
-
This reddit comment recommends |
Beta Was this translation helpful? Give feedback.
-
It also appears that services like datadog have built-in capability for converting structured logs into metrics. |
Beta Was this translation helpful? Give feedback.
-
We need to make our node (and other binaries) observable. Let's discuss our options..
Related issues: #144 and miden-base/1004
Overview
Producer crates
I believe the two main candidates here are tracing and metrics (or some variation thereof). Both provide the emitters and there are many consumer crates to ingest/transform these.
Wide events
Personally I'm a fan of wide events. As a summary: instead of separate logs, traces and metrics just use structured events and the consumer can decide how to interpret things.
I believe
tracing
(despite the name) essentially already covers this i.e. I'm suggesting we use onlytracing
, including for metrics.This does have a downside - wide events are much more larger than pure metrics. What I'm hoping is that tracing/metrics will be as simple as
#[tracing::instrument]
in the appropriate places that are then consumed to generate logs, metrics and traces.The major benefit to me is that we decouple the production of events, from the interpretation i.e. don't worry about metric vs log info, just instrument the main methods and let the consumers present the data as appropriate. Of course still keeping the data size in mind i.e. don't emit the entire block's transaction data - probably just transaction IDs are good enough.
Consumers
I'm aware of
grafana
for metrics, andjaeger
for traces. Here is a semi-recent blog post with setup and examples.Similarly we need to decide on using a more general protocol like open-telemetry, or specific ones e.g. prometheus for metrics. Personally open-telemetry has always been confusing af as a standard to me, but I'm not aware of a better option..
We may also want to reach out to devops to understand what's in use within Polygon.
Node specifics
#144 had some outlines on what should be tracked, though I believe that was looking too far ahead ito what we have available to us now.
Trace
target=
Currently we are overriding the default
target
by doingtarget = COMPONENT
in all trace events. This is quite error prone as its easy to forget, and in some cases we also usetarget=miden-store
hardcoded instead.There is probably some way to avoid this, using custom layers or by manipulating the log subscriber instead.
Metrics
We definitely want metrics for the various RPC interactions. This should be easy using
#[instrument]
, and should include the timings.In general we want to instrument/time at minimum:
Traces
What would be amazing is to visualize the lifetime of RPC requests, transactions, notes, batches, blocks etc. As an example, for a specific transaction it would be amazing to query
jaerger
by transaction ID and have it plot a timeline:This should be possible if each of these events includes the transaction ID(s), which should be somewhat trivial to instrument. I'm not familiar enough with
jaeger
and trace consumers to really know how cheap/easy this is to do.But I do believe this could be great for debugging live systems; though maybe its overkill and logs are good enough..
Something to consider is assigning a UUID to each RPC request so that we can correlate internal rpc calls e.g.
user request -> rpc -> store
.Beta Was this translation helpful? Give feedback.
All reactions