chore(metrics): restrict metrics crate usage to astria-telemetry#1192
chore(metrics): restrict metrics crate usage to astria-telemetry#1192
metrics crate usage to astria-telemetry#1192Conversation
eee59d5 to
661cd58
Compare
## Summary This adds further metrics to the sequencer. ## Background This should help diagnose block production slowdown when the sequencer is stress-tested. ## Changes - Added metrics (see below for list). - Enabled `cnidarium` metrics. Note that all histograms are still rendered as Prometheus summaries for now. I have [an open PR](#1192) which will make it simple to provide buckets for histograms, after which they will be rendered as true histograms. ## Testing Testing will likewise be relatively simple once #1192 is merged. In the meantime, I ran the smoke test using the code in this PR and manually checked the new metrics are available and appear sane. ## Metrics - Added `astria_sequencer_check_tx_duration_seconds` histograms with the following labels: - `length check and parse raw tx` - `stateless check` - `nonce check` - `chain id check` - `balance check` - `check for removal` - `insert to app mempool` - Added `astria_sequencer_actions_per_transaction_in_mempool` histogram - Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram - Added `astria_sequencer_transactions_in_mempool_total` gauge - Enabled `cnidarium_get_raw_duration_seconds` histogram - Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram ## Related Issues Closes #1247.
## Summary This adds further metrics to the sequencer. ## Background This should help diagnose block production slowdown when the sequencer is stress-tested. ## Changes - Added metrics (see below for list). - Enabled `cnidarium` metrics. Note that all histograms are still rendered as Prometheus summaries for now. I have [an open PR](#1192) which will make it simple to provide buckets for histograms, after which they will be rendered as true histograms. ## Testing Testing will likewise be relatively simple once #1192 is merged. In the meantime, I ran the smoke test using the code in this PR and manually checked the new metrics are available and appear sane. ## Metrics - Added `astria_sequencer_check_tx_duration_seconds` histograms with the following labels: - `length check and parse raw tx` - `stateless check` - `nonce check` - `chain id check` - `balance check` - `check for removal` - `insert to app mempool` - Added `astria_sequencer_actions_per_transaction_in_mempool` histogram - Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram - Added `astria_sequencer_transactions_in_mempool_total` gauge - Enabled `cnidarium_get_raw_duration_seconds` histogram - Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram ## Related Issues Closes #1247.
SuperFluffy
left a comment
There was a problem hiding this comment.
I am very sorry this review took so long.
I have to say I am somewhat ambivalent with us taking such a big detour from "normal" Rust metric handling using the global registry and macros only. It's annoying that we end up threading the &'static Metric through every corner of our stack.
On the other hand I understand and welcome that the invidiual metric types are handled more strictly.
|
|
||
| /// Enables or disables setting the global metrics recorder. | ||
| #[must_use] | ||
| pub fn with_global_recorder(mut self, use_global_recorder: bool) -> Self { |
There was a problem hiding this comment.
bikeshed: for builders, I have used the convention to use e.g. fn global_recorder(mut self) to activate it, but set_global_recorder(mut self, use_global_recorder: bool) to take arguments (same for the other setters).
There was a problem hiding this comment.
I think it's a little unconventional to have two ways of doing the same thing in a builder, and a method like global_recorder looks more like a getter than anything else to me. Happy to change the prefix with_ to set_ though :)
| metrics::set_global_recorder(registering_builder.recorder) | ||
| .map_err(|_| Error::GlobalMetricsRecorderAlreadySet)?; |
There was a problem hiding this comment.
Does this need a https://doc.rust-lang.org/stable/std/sync/struct.LazyLock.html similar to what we do when setting up tracing in tests?
I don't know how metrics behaves if one tries to register multiple global recorders - tracing panics. Maybe we can provide an extra flag use_test_protection (or something less asinine) that ensures that no more than 1 global recorder is set?
(In services it's IMO desirable to just panic because one should not have more than 1 global recorder).
There was a problem hiding this comment.
It does return an error if we try to register a global recorder more than once. The error is being returned from from astria_telemetry::Config::try_init in the same way as for a tracing init error, so we're covered that way for production code.
In all our tests (where we use the LazyLock for the telemetry config) we're not setting the metrics up as part of the astria_telemetry::configure() flow, since we want to avoid using the global recorder. So in all tests where your question applies, we're doing something like
let (metrics, metrics_handle) = metrics::ConfigBuilder::new()
.set_global_recorder(false)
.build(&())
.unwrap();meaning we won't have an error returned since we're not setting the global recorder.
* main: chore: ibc e2e smoke test (#1284) chore(metrics): restrict `metrics` crate usage to `astria-telemetry` (#1192) fix(charts)!: sequencer-relayer chart correct startup env var (#1437) chore(bridge-withdrawer): Add instrumentation (#1324) chore(conductor): Add instrumentation (#1330) fix(cli, bridge-withdrawer): dont fail entire block due to bad withdraw event (#1409) feat(sequencer, bridge-withdrawer)!: enforce withdrawals consumed (#1391)
…1192) ## Summary This PR introduces several types in `astria-telemetry` to support building and recording metrics in our other crates. ## Background There are a few reasons driving the design here: * I wanted to remove the third-party `metrics` crate as a direct dependency from all our crates except `astria-telemetry` so that its macros aren't available in our other crates. This precludes accidental addition of a new metric in the future which isn't properly initialized at app startup. * There's no possibility of a description for a given metric being ignored due to an accidental mismatch in the `metrics` macros being used (e.g. we previously had a case of `describe_gauge!` being used with a `counter!` meaning the description was ignored) * We can set buckets on a per-histogram basis * We can set up metrics in our black box tests without requiring an actual http server to be running. Configuring the `metrics` crate currently doesn't support either passing in an already-bound listener (where we could bind to port 0 and find out what actual port was used before giving the listener to `metrics`) or querying the running server in `metrics` to find out the port. This makes reliable testing using an actual http server essentially impossible. With the changes in this PR, the black box tests will retain a handle to the underlying exporter, meaning snapshots can be rendered which yield the same view of the metrics as if the http endpoint had been called. ## Changes * Added newtypes for `Counter`, `Gauge` and `Histogram`. * Added a `Metrics` trait, implemented by each of the crates' `Metrics` structs to support registering the metrics via `telemetry::Config::try_init`. * Added `metrics::ConfigBuilder` to support configuring and initializing metrics outside of `telemetry::Config::try_init` (for use in black box tests). * Added `metrics::RegisteringBuilder` to support registering individual metrics via the `Metrics` trait. To register a metric, this builder returns an appropriate factory, and the factory is then used potentially multiple times to register and return the individual metric with or without groups of labels applied. * Added `metrics::BucketBuilder` to support registering histogram buckets via the `Metrics` trait. This is unfortunately a separate builder since the `metrics` crate doesn't support registering buckets at the same point as registering the histograms. The former must happen via the `PrometheusBuilder`, but the latter can only be done once the `PrometheusBuilder` has been consumed and converted to a `PrometheusRecorder`. I added checks to ensure that buckets set via the `BucketBuilder` are all related to actual histograms registered later. * Changed the `telemetry::Config` setters to use a consistent `set_` naming convention and merged the two metrics-related setters into a single one, since now we require either both or neither. Note that these changes aren't mutually exclusive to using the raw `metrics` crate macros. So while we want to discourage usage in our own crates, dependencies using the macros will continue to function as before assuming we have configured metrics to use the global recorder. Using `telemetry::Config::try_init` does enable the global recorder, whereas we disable it in our black box tests so they don't conflict. ## Testing Manually checked metrics are still available right after startup, and manually checked the handle provided to black box tests functions as expected. However, proper testing of metrics should now be possible and I'd like to tackle that in follow-up PRs. ## Related Issues Closes #1147.
…1192) ## Summary This PR introduces several types in `astria-telemetry` to support building and recording metrics in our other crates. ## Background There are a few reasons driving the design here: * I wanted to remove the third-party `metrics` crate as a direct dependency from all our crates except `astria-telemetry` so that its macros aren't available in our other crates. This precludes accidental addition of a new metric in the future which isn't properly initialized at app startup. * There's no possibility of a description for a given metric being ignored due to an accidental mismatch in the `metrics` macros being used (e.g. we previously had a case of `describe_gauge!` being used with a `counter!` meaning the description was ignored) * We can set buckets on a per-histogram basis * We can set up metrics in our black box tests without requiring an actual http server to be running. Configuring the `metrics` crate currently doesn't support either passing in an already-bound listener (where we could bind to port 0 and find out what actual port was used before giving the listener to `metrics`) or querying the running server in `metrics` to find out the port. This makes reliable testing using an actual http server essentially impossible. With the changes in this PR, the black box tests will retain a handle to the underlying exporter, meaning snapshots can be rendered which yield the same view of the metrics as if the http endpoint had been called. ## Changes * Added newtypes for `Counter`, `Gauge` and `Histogram`. * Added a `Metrics` trait, implemented by each of the crates' `Metrics` structs to support registering the metrics via `telemetry::Config::try_init`. * Added `metrics::ConfigBuilder` to support configuring and initializing metrics outside of `telemetry::Config::try_init` (for use in black box tests). * Added `metrics::RegisteringBuilder` to support registering individual metrics via the `Metrics` trait. To register a metric, this builder returns an appropriate factory, and the factory is then used potentially multiple times to register and return the individual metric with or without groups of labels applied. * Added `metrics::BucketBuilder` to support registering histogram buckets via the `Metrics` trait. This is unfortunately a separate builder since the `metrics` crate doesn't support registering buckets at the same point as registering the histograms. The former must happen via the `PrometheusBuilder`, but the latter can only be done once the `PrometheusBuilder` has been consumed and converted to a `PrometheusRecorder`. I added checks to ensure that buckets set via the `BucketBuilder` are all related to actual histograms registered later. * Changed the `telemetry::Config` setters to use a consistent `set_` naming convention and merged the two metrics-related setters into a single one, since now we require either both or neither. Note that these changes aren't mutually exclusive to using the raw `metrics` crate macros. So while we want to discourage usage in our own crates, dependencies using the macros will continue to function as before assuming we have configured metrics to use the global recorder. Using `telemetry::Config::try_init` does enable the global recorder, whereas we disable it in our black box tests so they don't conflict. ## Testing Manually checked metrics are still available right after startup, and manually checked the handle provided to black box tests functions as expected. However, proper testing of metrics should now be possible and I'd like to tackle that in follow-up PRs. ## Related Issues Closes #1147.
…1192) ## Summary This PR introduces several types in `astria-telemetry` to support building and recording metrics in our other crates. ## Background There are a few reasons driving the design here: * I wanted to remove the third-party `metrics` crate as a direct dependency from all our crates except `astria-telemetry` so that its macros aren't available in our other crates. This precludes accidental addition of a new metric in the future which isn't properly initialized at app startup. * There's no possibility of a description for a given metric being ignored due to an accidental mismatch in the `metrics` macros being used (e.g. we previously had a case of `describe_gauge!` being used with a `counter!` meaning the description was ignored) * We can set buckets on a per-histogram basis * We can set up metrics in our black box tests without requiring an actual http server to be running. Configuring the `metrics` crate currently doesn't support either passing in an already-bound listener (where we could bind to port 0 and find out what actual port was used before giving the listener to `metrics`) or querying the running server in `metrics` to find out the port. This makes reliable testing using an actual http server essentially impossible. With the changes in this PR, the black box tests will retain a handle to the underlying exporter, meaning snapshots can be rendered which yield the same view of the metrics as if the http endpoint had been called. ## Changes * Added newtypes for `Counter`, `Gauge` and `Histogram`. * Added a `Metrics` trait, implemented by each of the crates' `Metrics` structs to support registering the metrics via `telemetry::Config::try_init`. * Added `metrics::ConfigBuilder` to support configuring and initializing metrics outside of `telemetry::Config::try_init` (for use in black box tests). * Added `metrics::RegisteringBuilder` to support registering individual metrics via the `Metrics` trait. To register a metric, this builder returns an appropriate factory, and the factory is then used potentially multiple times to register and return the individual metric with or without groups of labels applied. * Added `metrics::BucketBuilder` to support registering histogram buckets via the `Metrics` trait. This is unfortunately a separate builder since the `metrics` crate doesn't support registering buckets at the same point as registering the histograms. The former must happen via the `PrometheusBuilder`, but the latter can only be done once the `PrometheusBuilder` has been consumed and converted to a `PrometheusRecorder`. I added checks to ensure that buckets set via the `BucketBuilder` are all related to actual histograms registered later. * Changed the `telemetry::Config` setters to use a consistent `set_` naming convention and merged the two metrics-related setters into a single one, since now we require either both or neither. Note that these changes aren't mutually exclusive to using the raw `metrics` crate macros. So while we want to discourage usage in our own crates, dependencies using the macros will continue to function as before assuming we have configured metrics to use the global recorder. Using `telemetry::Config::try_init` does enable the global recorder, whereas we disable it in our black box tests so they don't conflict. ## Testing Manually checked metrics are still available right after startup, and manually checked the handle provided to black box tests functions as expected. However, proper testing of metrics should now be possible and I'd like to tackle that in follow-up PRs. ## Related Issues Closes #1147.
## Summary This adds further metrics to the sequencer. ## Background This should help diagnose block production slowdown when the sequencer is stress-tested. ## Changes - Added metrics (see below for list). - Enabled `cnidarium` metrics. Note that all histograms are still rendered as Prometheus summaries for now. I have [an open PR](astriaorg/astria#1192) which will make it simple to provide buckets for histograms, after which they will be rendered as true histograms. ## Testing Testing will likewise be relatively simple once #1192 is merged. In the meantime, I ran the smoke test using the code in this PR and manually checked the new metrics are available and appear sane. ## Metrics - Added `astria_sequencer_check_tx_duration_seconds` histograms with the following labels: - `length check and parse raw tx` - `stateless check` - `nonce check` - `chain id check` - `balance check` - `check for removal` - `insert to app mempool` - Added `astria_sequencer_actions_per_transaction_in_mempool` histogram - Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram - Added `astria_sequencer_transactions_in_mempool_total` gauge - Enabled `cnidarium_get_raw_duration_seconds` histogram - Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram ## Related Issues Closes #1247.
## Summary This adds further metrics to the sequencer. ## Background This should help diagnose block production slowdown when the sequencer is stress-tested. ## Changes - Added metrics (see below for list). - Enabled `cnidarium` metrics. Note that all histograms are still rendered as Prometheus summaries for now. I have [an open PR](astriaorg/astria#1192) which will make it simple to provide buckets for histograms, after which they will be rendered as true histograms. ## Testing Testing will likewise be relatively simple once #1192 is merged. In the meantime, I ran the smoke test using the code in this PR and manually checked the new metrics are available and appear sane. ## Metrics - Added `astria_sequencer_check_tx_duration_seconds` histograms with the following labels: - `length check and parse raw tx` - `stateless check` - `nonce check` - `chain id check` - `balance check` - `check for removal` - `insert to app mempool` - Added `astria_sequencer_actions_per_transaction_in_mempool` histogram - Added `astria_sequencer_transaction_in_mempool_size_bytes` histogram - Added `astria_sequencer_transactions_in_mempool_total` gauge - Enabled `cnidarium_get_raw_duration_seconds` histogram - Enabled `cnidarium_nonverifiable_get_raw_duration_seconds` histogram ## Related Issues Closes #1247.
Summary
This PR introduces several types in
astria-telemetryto support building and recording metrics in our other crates.Background
There are a few reasons driving the design here:
metricscrate as a direct dependency from all our crates exceptastria-telemetryso that its macros aren't available in our other crates. This precludes accidental addition of a new metric in the future which isn't properly initialized at app startup.metricsmacros being used (e.g. we previously had a case ofdescribe_gauge!being used with acounter!meaning the description was ignored)metricscrate currently doesn't support either passing in an already-bound listener (where we could bind to port 0 and find out what actual port was used before giving the listener tometrics) or querying the running server inmetricsto find out the port. This makes reliable testing using an actual http server essentially impossible. With the changes in this PR, the black box tests will retain a handle to the underlying exporter, meaning snapshots can be rendered which yield the same view of the metrics as if the http endpoint had been called.Changes
Counter,GaugeandHistogram.Metricstrait, implemented by each of the crates'Metricsstructs to support registering the metrics viatelemetry::Config::try_init.metrics::ConfigBuilderto support configuring and initializing metrics outside oftelemetry::Config::try_init(for use in black box tests).metrics::RegisteringBuilderto support registering individual metrics via theMetricstrait. To register a metric, this builder returns an appropriate factory, and the factory is then used potentially multiple times to register and return the individual metric with or without groups of labels applied.metrics::BucketBuilderto support registering histogram buckets via theMetricstrait. This is unfortunately a separate builder since themetricscrate doesn't support registering buckets at the same point as registering the histograms. The former must happen via thePrometheusBuilder, but the latter can only be done once thePrometheusBuilderhas been consumed and converted to aPrometheusRecorder. I added checks to ensure that buckets set via theBucketBuilderare all related to actual histograms registered later.telemetry::Configsetters to use a consistentset_naming convention and merged the two metrics-related setters into a single one, since now we require either both or neither.Note that these changes aren't mutually exclusive to using the raw
metricscrate macros. So while we want to discourage usage in our own crates, dependencies using the macros will continue to function as before assuming we have configured metrics to use the global recorder. Usingtelemetry::Config::try_initdoes enable the global recorder, whereas we disable it in our black box tests so they don't conflict.Testing
Manually checked metrics are still available right after startup, and manually checked the handle provided to black box tests functions as expected. However, proper testing of metrics should now be possible and I'd like to tackle that in follow-up PRs.
Related Issues
Closes #1147.