chore: register all metrics during startup#1144
Conversation
| metrics: &Metrics, | ||
| ) -> eyre::Result<u32> { | ||
| debug!("fetching latest nonce from sequencer"); | ||
| metrics::counter!(crate::metrics_init::NONCE_FETCH_COUNT).increment(1); |
There was a problem hiding this comment.
I moved this to inside tryhard::retry_fn on line 457, since it seems from the description ("The number of times we have attempted to fetch the nonce") like we want to count attempts rather than successes.
If we do mean success counts, I can revert this change and we could consider renaming the metric and description to reflect that?
There was a problem hiding this comment.
I think the original idea was not to count the individual GET requests but to count the "process" of getting a nonce.
Having said that, this metric feels a tad useless because it either closely tracks the number of failures/attempts (as it's doing now), or the number of data points added to the histogram.
I am actually leaning toward removing it altogether...
SuperFluffy
left a comment
There was a problem hiding this comment.
I like the approach this takes a lot. I have looked over the various methods calls setting the metrics and I can't find any obvious issues there.
The only bit that I would like to discuss prior to approving is the need to thread the &'static Metrics through the entire service. Since it's already a static (to deal with shared globals in the blackbox tests I suppose), we could avoid this entirely by making a global accessor akin to:
static METRICS: OnceLock<Metrics> = OnceLock::new();
pub(crate) get() -> &'static OnceLock {
METRICS.get_or_init(Metrics::new)
}This could avoid a lot of boilerplate.
| let Some(register_metrics) = register_metrics else { | ||
| return Err(Error::no_metric_register_func()); | ||
| }; | ||
| register_metrics(); |
There was a problem hiding this comment.
Comment on this and the high-level PR description: I overlooked this. In hindsight it's obvious - no sense in registering anything on a not-yet-initialized recorder.
This is also a good argument for having blackbox-level testing of metrics, which would have caught this.
| /// actors could not be spawned (executor, sequencer reader, or data availability reader). | ||
| /// This usually happens if the actors failed to connect to their respective endpoints. | ||
| pub fn new(cfg: Config) -> eyre::Result<Self> { | ||
| static METRICS: OnceLock<Metrics> = OnceLock::new(); |
There was a problem hiding this comment.
This comment applies to all instaces of Metrics:
Instead of initializing a static here and needing to pass it to every single caller (which on the one hand is nice, but on the other also very infectious), what about changing it to something like this:
// in module astria_conductor::metrics
static METRICS: OnceLock<Metrics> = OnceLock::new();
pub(crate) fn get() -> &'static Metrics {
METRICS.get_or_init(Metrics::new)
}
// at the call site
let metrics = crate::metrics::get();
todo!("do something with the metrics");There was a problem hiding this comment.
The intent here is to actively avoid creating a global static, since that's what ties our hands during testing.
I agree the approach in this PR is fairly intrusive, but by passing a static ref down to the callsites we can instantiate multiple tests concurrently and provide each test instance with its own exclusive Metrics struct, avoiding munging the metrics between tests.
As an aside, I considered just passing the Metrics struct everywhere by value, cloning as required. That's feasible with the implementation in this PR. The individual metrics are all just wrappers around an Arc, so relatively cheap to clone. But not as cheap as a static ref, so I opted for that (the Composer's Metrics is probably the most expensive to clone, holding five HashMaps of counters and another eight individual metrics).
* main: chore(bridge-withdrawer): add missing errors and clean up names (#1178) feat(sequencer): add ttl and invalid cache to app mempool (#1138) chore(astria-merkle): add benchmarks (#1179) chore(sequencer-relayer): add timeout to gRPCs to Celestia app (#1191) refactor(core): parse ics20 denoms as ibc or trace prefixed variants (#1181) Mycodecrafting/sequencer seed node (#1188) chore: register all metrics during startup (#1144) feat(charts): option to purge geth mempool (#1182)
Summary
Metrics in all five binaries are registered early in runtime flow.
Background
For a given binary, we want the metrics http endpoint to always yield data for every metric relevant to that binary, even if some of them have not been modified since startup. The
metrics::describe_macros (currently called during startup) don't register the provided metrics - instead individual metrics are only registered on the first execution of the relevantcounter!,gauge!orhistogram!macro.Further, the descriptions were never being applied, since in
astria-telemetrythe call to these macros was made before the metrics builder was installed.Changes
Similar changes were made to
astria-bridge-withdrawer,astria-composer,astria-conductor,astria-sequencerandastria-sequencer-relayer:metrics_initmodule renamed tometrics, and used to hold aMetricsstruct containing the relevant metrics objects.Metricsstruct is created early in the flow (but after the call to telemetry'sConfig::try_initwhere the metrics builder installs the global registry) and passed to whichever objects/functions require it. This approach will ultimately allow us to choose whether to use the global metrics registry or a local one, making the metrics much more testable in the future.Config'smetric_bucketssince it wasn't really useful (it applies the bucket limits to all histograms, regardless of type), andregister_metricssince it's also no longer used._latencyhistograms from ms to seconds, since they were actually recording seconds. This is a non-issue though, as the units are ignored in Prometheus.astria_conductor_sequencer_blocks_metadata_verified_per_celestia_fetchas this was missing.Testing
Manually tested that all metrics for a couple of the binaries are available in full as soon as the binary is started. Programmatic testing can be added in follow-up PRs.
Metrics
astria_composer_transactions_dropped_too_largefrom gauge to counter. We previously haddescribe_counter!to describe it and it was only ever being incremented by 1, but it was accessed viagauge!which seemed like a mistake, so I changed its type toCounter.astria_composer_transactions_receivedandastria_composer_transactions_droppedeach used to have two labels; one to ID the rollup and the other specifying collector type. For the geth collector, therollup_idlabel used the rollup's chain name, whereas for the grpc collector it used the rollup ID for therollup_idlabel. I changed these two metrics to always provide three labels;collector_type,rollup_idandrollup_chain_name.Related Issues
Closes #1027 and #1100.