Skip to content

chore: register all metrics during startup#1144

Merged
Fraser999 merged 2 commits intomainfrom
fraser/metrics-init
Jun 17, 2024
Merged

chore: register all metrics during startup#1144
Fraser999 merged 2 commits intomainfrom
fraser/metrics-init

Conversation

@Fraser999
Copy link
Contributor

Summary

Metrics in all five binaries are registered early in runtime flow.

Background

For a given binary, we want the metrics http endpoint to always yield data for every metric relevant to that binary, even if some of them have not been modified since startup. The metrics::describe_ macros (currently called during startup) don't register the provided metrics - instead individual metrics are only registered on the first execution of the relevant counter!, gauge! or histogram! macro.

Further, the descriptions were never being applied, since in astria-telemetry the call to these macros was made before the metrics builder was installed.

Changes

Similar changes were made to astria-bridge-withdrawer, astria-composer, astria-conductor, astria-sequencer and astria-sequencer-relayer:

  • metrics_init module renamed to metrics, and used to hold a Metrics struct containing the relevant metrics objects.
  • a static ref to a Metrics struct is created early in the flow (but after the call to telemetry's Config::try_init where the metrics builder installs the global registry) and passed to whichever objects/functions require it. This approach will ultimately allow us to choose whether to use the global metrics registry or a local one, making the metrics much more testable in the future.
  • removed telemetry Config's metric_buckets since it wasn't really useful (it applies the bucket limits to all histograms, regardless of type), and register_metrics since it's also no longer used.
  • changed the units of several _latency histograms from ms to seconds, since they were actually recording seconds. This is a non-issue though, as the units are ignored in Prometheus.
  • added a description for astria_conductor_sequencer_blocks_metadata_verified_per_celestia_fetch as this was missing.

Testing

Manually tested that all metrics for a couple of the binaries are available in full as soon as the binary is started. Programmatic testing can be added in follow-up PRs.

Metrics

  • Changed the type of astria_composer_transactions_dropped_too_large from gauge to counter. We previously had describe_counter! to describe it and it was only ever being incremented by 1, but it was accessed via gauge! which seemed like a mistake, so I changed its type to Counter.
  • astria_composer_transactions_received and astria_composer_transactions_dropped each used to have two labels; one to ID the rollup and the other specifying collector type. For the geth collector, the rollup_id label used the rollup's chain name, whereas for the grpc collector it used the rollup ID for the rollup_id label. I changed these two metrics to always provide three labels; collector_type, rollup_id and rollup_chain_name.

Related Issues

Closes #1027 and #1100.

@Fraser999 Fraser999 requested a review from a team as a code owner June 4, 2024 17:10
@Fraser999 Fraser999 requested a review from SuperFluffy June 4, 2024 17:10
@github-actions github-actions bot added conductor pertaining to the astria-conductor crate sequencer pertaining to the astria-sequencer crate sequencer-relayer pertaining to the astria-sequencer-relayer crate composer pertaining to composer labels Jun 4, 2024
metrics: &Metrics,
) -> eyre::Result<u32> {
debug!("fetching latest nonce from sequencer");
metrics::counter!(crate::metrics_init::NONCE_FETCH_COUNT).increment(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to inside tryhard::retry_fn on line 457, since it seems from the description ("The number of times we have attempted to fetch the nonce") like we want to count attempts rather than successes.

If we do mean success counts, I can revert this change and we could consider renaming the metric and description to reflect that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original idea was not to count the individual GET requests but to count the "process" of getting a nonce.

Having said that, this metric feels a tad useless because it either closely tracks the number of failures/attempts (as it's doing now), or the number of data points added to the histogram.

I am actually leaning toward removing it altogether...

Copy link
Contributor

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach this takes a lot. I have looked over the various methods calls setting the metrics and I can't find any obvious issues there.

The only bit that I would like to discuss prior to approving is the need to thread the &'static Metrics through the entire service. Since it's already a static (to deal with shared globals in the blackbox tests I suppose), we could avoid this entirely by making a global accessor akin to:

static METRICS: OnceLock<Metrics> = OnceLock::new();

pub(crate) get() -> &'static OnceLock {
    METRICS.get_or_init(Metrics::new)
}

This could avoid a lot of boilerplate.

let Some(register_metrics) = register_metrics else {
return Err(Error::no_metric_register_func());
};
register_metrics();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on this and the high-level PR description: I overlooked this. In hindsight it's obvious - no sense in registering anything on a not-yet-initialized recorder.

This is also a good argument for having blackbox-level testing of metrics, which would have caught this.

/// actors could not be spawned (executor, sequencer reader, or data availability reader).
/// This usually happens if the actors failed to connect to their respective endpoints.
pub fn new(cfg: Config) -> eyre::Result<Self> {
static METRICS: OnceLock<Metrics> = OnceLock::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies to all instaces of Metrics:

Instead of initializing a static here and needing to pass it to every single caller (which on the one hand is nice, but on the other also very infectious), what about changing it to something like this:

// in module astria_conductor::metrics
static METRICS: OnceLock<Metrics> = OnceLock::new();
pub(crate) fn get() -> &'static Metrics {
    METRICS.get_or_init(Metrics::new)
}

// at the call site
let metrics = crate::metrics::get();
todo!("do something with the metrics");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is to actively avoid creating a global static, since that's what ties our hands during testing.

I agree the approach in this PR is fairly intrusive, but by passing a static ref down to the callsites we can instantiate multiple tests concurrently and provide each test instance with its own exclusive Metrics struct, avoiding munging the metrics between tests.

As an aside, I considered just passing the Metrics struct everywhere by value, cloning as required. That's feasible with the implementation in this PR. The individual metrics are all just wrappers around an Arc, so relatively cheap to clone. But not as cheap as a static ref, so I opted for that (the Composer's Metrics is probably the most expensive to clone, holding five HashMaps of counters and another eight individual metrics).

@Fraser999 Fraser999 added this pull request to the merge queue Jun 17, 2024
Merged via the queue into main with commit 5f117cb Jun 17, 2024
@Fraser999 Fraser999 deleted the fraser/metrics-init branch June 17, 2024 15:09
steezeburger added a commit that referenced this pull request Jun 19, 2024
* main:
  chore(bridge-withdrawer): add missing errors and clean up names (#1178)
  feat(sequencer): add ttl and invalid cache to app mempool (#1138)
  chore(astria-merkle): add benchmarks (#1179)
  chore(sequencer-relayer): add timeout to gRPCs to Celestia app (#1191)
  refactor(core): parse ics20 denoms as ibc or trace prefixed variants (#1181)
  Mycodecrafting/sequencer seed node (#1188)
  chore: register all metrics during startup (#1144)
  feat(charts): option to purge geth mempool (#1182)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

composer pertaining to composer conductor pertaining to the astria-conductor crate sequencer pertaining to the astria-sequencer crate sequencer-relayer pertaining to the astria-sequencer-relayer crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

composer: Fix clippy::too_many_lines warning for the metrics_init::registry method

2 participants