Skip to content

chore(metrics): restrict metrics crate usage to astria-telemetry#1192

Merged
Fraser999 merged 10 commits intomainfrom
fraser/wrap-metrics
Aug 30, 2024
Merged

chore(metrics): restrict metrics crate usage to astria-telemetry#1192
Fraser999 merged 10 commits intomainfrom
fraser/wrap-metrics

Conversation

@Fraser999
Copy link
Contributor

@Fraser999 Fraser999 commented Jun 18, 2024

Summary

This PR introduces several types in astria-telemetry to support building and recording metrics in our other crates.

Background

There are a few reasons driving the design here:

  • I wanted to remove the third-party metrics crate as a direct dependency from all our crates except astria-telemetry so that its macros aren't available in our other crates. This precludes accidental addition of a new metric in the future which isn't properly initialized at app startup.
  • There's no possibility of a description for a given metric being ignored due to an accidental mismatch in the metrics macros being used (e.g. we previously had a case of describe_gauge! being used with a counter! meaning the description was ignored)
  • We can set buckets on a per-histogram basis
  • We can set up metrics in our black box tests without requiring an actual http server to be running. Configuring the metrics crate currently doesn't support either passing in an already-bound listener (where we could bind to port 0 and find out what actual port was used before giving the listener to metrics) or querying the running server in metrics to find out the port. This makes reliable testing using an actual http server essentially impossible. With the changes in this PR, the black box tests will retain a handle to the underlying exporter, meaning snapshots can be rendered which yield the same view of the metrics as if the http endpoint had been called.

Changes

  • Added newtypes for Counter, Gauge and Histogram.
  • Added a Metrics trait, implemented by each of the crates' Metrics structs to support registering the metrics via telemetry::Config::try_init.
  • Added metrics::ConfigBuilder to support configuring and initializing metrics outside of telemetry::Config::try_init (for use in black box tests).
  • Added metrics::RegisteringBuilder to support registering individual metrics via the Metrics trait. To register a metric, this builder returns an appropriate factory, and the factory is then used potentially multiple times to register and return the individual metric with or without groups of labels applied.
  • Added metrics::BucketBuilder to support registering histogram buckets via the Metrics trait. This is unfortunately a separate builder since the metrics crate doesn't support registering buckets at the same point as registering the histograms. The former must happen via the PrometheusBuilder, but the latter can only be done once the PrometheusBuilder has been consumed and converted to a PrometheusRecorder. I added checks to ensure that buckets set via the BucketBuilder are all related to actual histograms registered later.
  • Changed the telemetry::Config setters to use a consistent set_ naming convention and merged the two metrics-related setters into a single one, since now we require either both or neither.

Note that these changes aren't mutually exclusive to using the raw metrics crate macros. So while we want to discourage usage in our own crates, dependencies using the macros will continue to function as before assuming we have configured metrics to use the global recorder. Using telemetry::Config::try_init does enable the global recorder, whereas we disable it in our black box tests so they don't conflict.

Testing

Manually checked metrics are still available right after startup, and manually checked the handle provided to black box tests functions as expected. However, proper testing of metrics should now be possible and I'd like to tackle that in follow-up PRs.

Related Issues

Closes #1147.

@Fraser999 Fraser999 requested a review from a team as a code owner June 18, 2024 12:42
@Fraser999 Fraser999 requested a review from SuperFluffy June 18, 2024 12:42
@github-actions github-actions bot added conductor pertaining to the astria-conductor crate sequencer pertaining to the astria-sequencer crate sequencer-relayer pertaining to the astria-sequencer-relayer crate composer pertaining to composer labels Jun 18, 2024
@Fraser999 Fraser999 force-pushed the fraser/wrap-metrics branch from eee59d5 to 661cd58 Compare June 18, 2024 12:44
github-merge-queue bot pushed a commit that referenced this pull request Jul 9, 2024
## Summary
This adds further metrics to the sequencer.

## Background
This should help diagnose block production slowdown when the sequencer
is stress-tested.

## Changes
- Added metrics (see below for list).
- Enabled `cnidarium` metrics.

Note that all histograms are still rendered as Prometheus summaries for
now. I have [an open PR](#1192)
which will make it simple to provide buckets for histograms, after which
they will be rendered as true histograms.

## Testing
Testing will likewise be relatively simple once #1192 is merged. In the
meantime, I ran the smoke test using the code in this PR and manually
checked the new metrics are available and appear sane.

## Metrics
- Added `astria_sequencer_check_tx_duration_seconds` histograms with the
following labels:
    - `length check and parse raw tx`
    - `stateless check`
    - `nonce check`
    - `chain id check`
    - `balance check`
    - `check for removal`
    - `insert to app mempool`
- Added `astria_sequencer_actions_per_transaction_in_mempool` histogram
- Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram
- Added `astria_sequencer_transactions_in_mempool_total` gauge
- Enabled `cnidarium_get_raw_duration_seconds` histogram
- Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram

## Related Issues
Closes #1247.
bharath-123 pushed a commit that referenced this pull request Jul 25, 2024
## Summary
This adds further metrics to the sequencer.

## Background
This should help diagnose block production slowdown when the sequencer
is stress-tested.

## Changes
- Added metrics (see below for list).
- Enabled `cnidarium` metrics.

Note that all histograms are still rendered as Prometheus summaries for
now. I have [an open PR](#1192)
which will make it simple to provide buckets for histograms, after which
they will be rendered as true histograms.

## Testing
Testing will likewise be relatively simple once #1192 is merged. In the
meantime, I ran the smoke test using the code in this PR and manually
checked the new metrics are available and appear sane.

## Metrics
- Added `astria_sequencer_check_tx_duration_seconds` histograms with the
following labels:
    - `length check and parse raw tx`
    - `stateless check`
    - `nonce check`
    - `chain id check`
    - `balance check`
    - `check for removal`
    - `insert to app mempool`
- Added `astria_sequencer_actions_per_transaction_in_mempool` histogram
- Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram
- Added `astria_sequencer_transactions_in_mempool_total` gauge
- Enabled `cnidarium_get_raw_duration_seconds` histogram
- Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram

## Related Issues
Closes #1247.
Copy link
Contributor

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very sorry this review took so long.

I have to say I am somewhat ambivalent with us taking such a big detour from "normal" Rust metric handling using the global registry and macros only. It's annoying that we end up threading the &'static Metric through every corner of our stack.

On the other hand I understand and welcome that the invidiual metric types are handled more strictly.


/// Enables or disables setting the global metrics recorder.
#[must_use]
pub fn with_global_recorder(mut self, use_global_recorder: bool) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bikeshed: for builders, I have used the convention to use e.g. fn global_recorder(mut self) to activate it, but set_global_recorder(mut self, use_global_recorder: bool) to take arguments (same for the other setters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a little unconventional to have two ways of doing the same thing in a builder, and a method like global_recorder looks more like a getter than anything else to me. Happy to change the prefix with_ to set_ though :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7432b9c.

Comment on lines +137 to +138
metrics::set_global_recorder(registering_builder.recorder)
.map_err(|_| Error::GlobalMetricsRecorderAlreadySet)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a https://doc.rust-lang.org/stable/std/sync/struct.LazyLock.html similar to what we do when setting up tracing in tests?

I don't know how metrics behaves if one tries to register multiple global recorders - tracing panics. Maybe we can provide an extra flag use_test_protection (or something less asinine) that ensures that no more than 1 global recorder is set?

(In services it's IMO desirable to just panic because one should not have more than 1 global recorder).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does return an error if we try to register a global recorder more than once. The error is being returned from from astria_telemetry::Config::try_init in the same way as for a tracing init error, so we're covered that way for production code.

In all our tests (where we use the LazyLock for the telemetry config) we're not setting the metrics up as part of the astria_telemetry::configure() flow, since we want to avoid using the global recorder. So in all tests where your question applies, we're doing something like

let (metrics, metrics_handle) = metrics::ConfigBuilder::new()
    .set_global_recorder(false)
    .build(&())
    .unwrap();

meaning we won't have an error returned since we're not setting the global recorder.

@Fraser999 Fraser999 enabled auto-merge August 30, 2024 21:59
@Fraser999 Fraser999 added this pull request to the merge queue Aug 30, 2024
Merged via the queue into main with commit f251316 Aug 30, 2024
@Fraser999 Fraser999 deleted the fraser/wrap-metrics branch August 30, 2024 22:14
steezeburger added a commit that referenced this pull request Sep 3, 2024
* main:
  chore: ibc e2e smoke test (#1284)
  chore(metrics): restrict `metrics` crate usage to `astria-telemetry` (#1192)
  fix(charts)!: sequencer-relayer chart correct startup env var (#1437)
  chore(bridge-withdrawer): Add instrumentation (#1324)
  chore(conductor): Add instrumentation (#1330)
  fix(cli, bridge-withdrawer): dont fail entire block due to bad withdraw event (#1409)
  feat(sequencer, bridge-withdrawer)!: enforce withdrawals consumed (#1391)
jbowen93 pushed a commit that referenced this pull request Sep 3, 2024
…1192)

## Summary
This PR introduces several types in `astria-telemetry` to support
building and recording metrics in our other crates.

## Background
There are a few reasons driving the design here:


* I wanted to remove the third-party `metrics` crate as a direct
dependency from all our crates except `astria-telemetry` so that its
macros aren't available in our other crates. This precludes accidental
addition of a new metric in the future which isn't properly initialized
at app startup.
* There's no possibility of a description for a given metric being
ignored due to an accidental mismatch in the `metrics` macros being used
(e.g. we previously had a case of `describe_gauge!` being used with a
`counter!` meaning the description was ignored)
* We can set buckets on a per-histogram basis
* We can set up metrics in our black box tests without requiring an
actual http server to be running. Configuring the `metrics` crate
currently doesn't support either passing in an already-bound listener
(where we could bind to port 0 and find out what actual port was used
before giving the listener to `metrics`) or querying the running server
in `metrics` to find out the port. This makes reliable testing using an
actual http server essentially impossible. With the changes in this PR,
the black box tests will retain a handle to the underlying exporter,
meaning snapshots can be rendered which yield the same view of the
metrics as if the http endpoint had been called.

## Changes
* Added newtypes for `Counter`, `Gauge` and `Histogram`.
* Added a `Metrics` trait, implemented by each of the crates' `Metrics`
structs to support registering the metrics via
`telemetry::Config::try_init`.
* Added `metrics::ConfigBuilder` to support configuring and initializing
metrics outside of `telemetry::Config::try_init` (for use in black box
tests).
* Added `metrics::RegisteringBuilder` to support registering individual
metrics via the `Metrics` trait. To register a metric, this builder
returns an appropriate factory, and the factory is then used potentially
multiple times to register and return the individual metric with or
without groups of labels applied.
* Added `metrics::BucketBuilder` to support registering histogram
buckets via the `Metrics` trait. This is unfortunately a separate
builder since the `metrics` crate doesn't support registering buckets at
the same point as registering the histograms. The former must happen via
the `PrometheusBuilder`, but the latter can only be done once the
`PrometheusBuilder` has been consumed and converted to a
`PrometheusRecorder`. I added checks to ensure that buckets set via the
`BucketBuilder` are all related to actual histograms registered later.
* Changed the `telemetry::Config` setters to use a consistent `set_`
naming convention and merged the two metrics-related setters into a
single one, since now we require either both or neither.

Note that these changes aren't mutually exclusive to using the raw
`metrics` crate macros. So while we want to discourage usage in our own
crates, dependencies using the macros will continue to function as
before assuming we have configured metrics to use the global recorder.
Using `telemetry::Config::try_init` does enable the global recorder,
whereas we disable it in our black box tests so they don't conflict.

## Testing
Manually checked metrics are still available right after startup, and
manually checked the handle provided to black box tests functions as
expected. However, proper testing of metrics should now be possible and
I'd like to tackle that in follow-up PRs.

## Related Issues
Closes #1147.
ethanoroshiba pushed a commit that referenced this pull request Sep 4, 2024
…1192)

## Summary
This PR introduces several types in `astria-telemetry` to support
building and recording metrics in our other crates.

## Background
There are a few reasons driving the design here:


* I wanted to remove the third-party `metrics` crate as a direct
dependency from all our crates except `astria-telemetry` so that its
macros aren't available in our other crates. This precludes accidental
addition of a new metric in the future which isn't properly initialized
at app startup.
* There's no possibility of a description for a given metric being
ignored due to an accidental mismatch in the `metrics` macros being used
(e.g. we previously had a case of `describe_gauge!` being used with a
`counter!` meaning the description was ignored)
* We can set buckets on a per-histogram basis
* We can set up metrics in our black box tests without requiring an
actual http server to be running. Configuring the `metrics` crate
currently doesn't support either passing in an already-bound listener
(where we could bind to port 0 and find out what actual port was used
before giving the listener to `metrics`) or querying the running server
in `metrics` to find out the port. This makes reliable testing using an
actual http server essentially impossible. With the changes in this PR,
the black box tests will retain a handle to the underlying exporter,
meaning snapshots can be rendered which yield the same view of the
metrics as if the http endpoint had been called.

## Changes
* Added newtypes for `Counter`, `Gauge` and `Histogram`.
* Added a `Metrics` trait, implemented by each of the crates' `Metrics`
structs to support registering the metrics via
`telemetry::Config::try_init`.
* Added `metrics::ConfigBuilder` to support configuring and initializing
metrics outside of `telemetry::Config::try_init` (for use in black box
tests).
* Added `metrics::RegisteringBuilder` to support registering individual
metrics via the `Metrics` trait. To register a metric, this builder
returns an appropriate factory, and the factory is then used potentially
multiple times to register and return the individual metric with or
without groups of labels applied.
* Added `metrics::BucketBuilder` to support registering histogram
buckets via the `Metrics` trait. This is unfortunately a separate
builder since the `metrics` crate doesn't support registering buckets at
the same point as registering the histograms. The former must happen via
the `PrometheusBuilder`, but the latter can only be done once the
`PrometheusBuilder` has been consumed and converted to a
`PrometheusRecorder`. I added checks to ensure that buckets set via the
`BucketBuilder` are all related to actual histograms registered later.
* Changed the `telemetry::Config` setters to use a consistent `set_`
naming convention and merged the two metrics-related setters into a
single one, since now we require either both or neither.

Note that these changes aren't mutually exclusive to using the raw
`metrics` crate macros. So while we want to discourage usage in our own
crates, dependencies using the macros will continue to function as
before assuming we have configured metrics to use the global recorder.
Using `telemetry::Config::try_init` does enable the global recorder,
whereas we disable it in our black box tests so they don't conflict.

## Testing
Manually checked metrics are still available right after startup, and
manually checked the handle provided to black box tests functions as
expected. However, proper testing of metrics should now be possible and
I'd like to tackle that in follow-up PRs.

## Related Issues
Closes #1147.
jbowen93 pushed a commit that referenced this pull request Sep 6, 2024
…1192)

## Summary
This PR introduces several types in `astria-telemetry` to support
building and recording metrics in our other crates.

## Background
There are a few reasons driving the design here:


* I wanted to remove the third-party `metrics` crate as a direct
dependency from all our crates except `astria-telemetry` so that its
macros aren't available in our other crates. This precludes accidental
addition of a new metric in the future which isn't properly initialized
at app startup.
* There's no possibility of a description for a given metric being
ignored due to an accidental mismatch in the `metrics` macros being used
(e.g. we previously had a case of `describe_gauge!` being used with a
`counter!` meaning the description was ignored)
* We can set buckets on a per-histogram basis
* We can set up metrics in our black box tests without requiring an
actual http server to be running. Configuring the `metrics` crate
currently doesn't support either passing in an already-bound listener
(where we could bind to port 0 and find out what actual port was used
before giving the listener to `metrics`) or querying the running server
in `metrics` to find out the port. This makes reliable testing using an
actual http server essentially impossible. With the changes in this PR,
the black box tests will retain a handle to the underlying exporter,
meaning snapshots can be rendered which yield the same view of the
metrics as if the http endpoint had been called.

## Changes
* Added newtypes for `Counter`, `Gauge` and `Histogram`.
* Added a `Metrics` trait, implemented by each of the crates' `Metrics`
structs to support registering the metrics via
`telemetry::Config::try_init`.
* Added `metrics::ConfigBuilder` to support configuring and initializing
metrics outside of `telemetry::Config::try_init` (for use in black box
tests).
* Added `metrics::RegisteringBuilder` to support registering individual
metrics via the `Metrics` trait. To register a metric, this builder
returns an appropriate factory, and the factory is then used potentially
multiple times to register and return the individual metric with or
without groups of labels applied.
* Added `metrics::BucketBuilder` to support registering histogram
buckets via the `Metrics` trait. This is unfortunately a separate
builder since the `metrics` crate doesn't support registering buckets at
the same point as registering the histograms. The former must happen via
the `PrometheusBuilder`, but the latter can only be done once the
`PrometheusBuilder` has been consumed and converted to a
`PrometheusRecorder`. I added checks to ensure that buckets set via the
`BucketBuilder` are all related to actual histograms registered later.
* Changed the `telemetry::Config` setters to use a consistent `set_`
naming convention and merged the two metrics-related setters into a
single one, since now we require either both or neither.

Note that these changes aren't mutually exclusive to using the raw
`metrics` crate macros. So while we want to discourage usage in our own
crates, dependencies using the macros will continue to function as
before assuming we have configured metrics to use the global recorder.
Using `telemetry::Config::try_init` does enable the global recorder,
whereas we disable it in our black box tests so they don't conflict.

## Testing
Manually checked metrics are still available right after startup, and
manually checked the handle provided to black box tests functions as
expected. However, proper testing of metrics should now be possible and
I'd like to tackle that in follow-up PRs.

## Related Issues
Closes #1147.
sgranfield4403-3 added a commit to sgranfield4403-3/astria that referenced this pull request Oct 2, 2025
## Summary
This adds further metrics to the sequencer.

## Background
This should help diagnose block production slowdown when the sequencer
is stress-tested.

## Changes
- Added metrics (see below for list).
- Enabled `cnidarium` metrics.

Note that all histograms are still rendered as Prometheus summaries for
now. I have [an open PR](astriaorg/astria#1192)
which will make it simple to provide buckets for histograms, after which
they will be rendered as true histograms.

## Testing
Testing will likewise be relatively simple once #1192 is merged. In the
meantime, I ran the smoke test using the code in this PR and manually
checked the new metrics are available and appear sane.

## Metrics
- Added `astria_sequencer_check_tx_duration_seconds` histograms with the
following labels:
    - `length check and parse raw tx`
    - `stateless check`
    - `nonce check`
    - `chain id check`
    - `balance check`
    - `check for removal`
    - `insert to app mempool`
- Added `astria_sequencer_actions_per_transaction_in_mempool` histogram
- Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram
- Added `astria_sequencer_transactions_in_mempool_total` gauge
- Enabled `cnidarium_get_raw_duration_seconds` histogram
- Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram

## Related Issues
Closes #1247.
AngieD101 added a commit to AngieD101/astria that referenced this pull request Oct 10, 2025
## Summary
This adds further metrics to the sequencer.

## Background
This should help diagnose block production slowdown when the sequencer
is stress-tested.

## Changes
- Added metrics (see below for list).
- Enabled `cnidarium` metrics.

Note that all histograms are still rendered as Prometheus summaries for
now. I have [an open PR](astriaorg/astria#1192)
which will make it simple to provide buckets for histograms, after which
they will be rendered as true histograms.

## Testing
Testing will likewise be relatively simple once #1192 is merged. In the
meantime, I ran the smoke test using the code in this PR and manually
checked the new metrics are available and appear sane.

## Metrics
- Added `astria_sequencer_check_tx_duration_seconds` histograms with the
following labels:
    - `length check and parse raw tx`
    - `stateless check`
    - `nonce check`
    - `chain id check`
    - `balance check`
    - `check for removal`
    - `insert to app mempool`
- Added `astria_sequencer_actions_per_transaction_in_mempool` histogram
- Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram
- Added `astria_sequencer_transactions_in_mempool_total` gauge
- Enabled `cnidarium_get_raw_duration_seconds` histogram
- Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram

## Related Issues
Closes #1247.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

composer pertaining to composer conductor pertaining to the astria-conductor crate sequencer pertaining to the astria-sequencer crate sequencer-relayer pertaining to the astria-sequencer-relayer crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrap all metrics functionality in newtypes provided in astria-telemetry

2 participants