-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Complete metrics overhaul #1192
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, windows-2022 / stable, and 73 discussions need to be resolved
Cargo.toml
line 74 at r1 (raw file):
[dev-dependencies] nativelink-metric-tests = { path = "nativelink-metric-tests" }
todo: check all of these to ensure we still need them.
nativelink-metric/nativelink-metric-macro-derive/src/lib.rs
line 26 at r1 (raw file):
#[derive(Default, Debug)] enum GroupType {
todo: Document this struct.
nativelink-metric/nativelink-metric-macro-derive/src/lib.rs
line 56 at r1 (raw file):
#[derive(Debug)] enum MetricKind {
todo: Document this struct.
nativelink-metric/nativelink-metric-macro-derive/src/lib.rs
line 221 at r1 (raw file):
}, }; // panic!("{}", quote! { #metrics_struct });
todo: Put a comment here explaining why this was left in.
nativelink-metric/src/lib.rs
line 35 at r1 (raw file):
#[derive(Debug)] pub struct Error(String);
todo: Comment why this does not use nativelink-error.
nativelink-metric/src/lib.rs
line 46 at r1 (raw file):
#[derive(Default, Clone)] pub struct MetricFieldData<'a> {
todo: Comment this struct.
nativelink-metric/src/lib.rs
line 53 at r1 (raw file):
#[derive(Debug)] pub enum MetricPublishKnownKindData {
todo: Comment this struct.
nativelink-metric/src/lib.rs
line 61 at r1 (raw file):
#[derive(Clone, Copy, Debug)] #[repr(u8)] pub enum MetricKind {
todo: Comment this struct.
nativelink-metric/src/lib.rs
line 480 at r1 (raw file):
$crate::__metric_span!(target: "nativelink_metric", "", __name = $name.to_string()) }; // ($name:expr, $($fields:tt)*) => {
todo: Remove.
nativelink-metric-collector/Cargo.toml
line 11 at r1 (raw file):
opentelemetry_sdk = { version = "0.23.0", features = ["metrics", "rt-tokio"] } tracing-subscriber = "0.3.18" # tracing-opentelemetry = { version = "0.25.0", features = ["metrics"] }
todo: Cleanup.
nativelink-metric-collector/src/lib.rs
line 1 at r1 (raw file):
pub use tracing_layers::MetricsCollectorLayer;
todo: Copyright.
nativelink-metric-collector/src/metrics_collection.rs
line 1 at r1 (raw file):
use std::{
todo: Copyright.
nativelink-metric-collector/src/metrics_collection.rs
line 13 at r1 (raw file):
#[derive(Debug, Serialize)] #[serde(untagged)] pub enum CollectedMetricPrimitiveValue {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_collection.rs
line 19 at r1 (raw file):
#[derive(Default, Debug)] pub struct CollectedMetricPrimitive {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_collection.rs
line 38 at r1 (raw file):
} pub type CollectedMetricChildren = HashMap<String, CollectedMetrics>;
todo: Comment this struct.
nativelink-metric-collector/src/metrics_collection.rs
line 42 at r1 (raw file):
#[derive(Debug, Serialize)] #[serde(untagged)] pub enum CollectedMetrics {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_collection.rs
line 54 at r1 (raw file):
#[derive(Default, Debug, Serialize)] pub struct RootMetricCollectedMetrics {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_collection.rs
line 60 at r1 (raw file):
impl RootMetricCollectedMetrics { pub fn to_json5(&self) -> Result<std::string::String, serde_json::Error> {
todo: Remove.
nativelink-metric-collector/src/metrics_visitors.rs
line 1 at r1 (raw file):
use std::{borrow::Cow, fmt::Debug};
todo: copyright.
nativelink-metric-collector/src/metrics_visitors.rs
line 10 at r1 (raw file):
#[derive(Default, Debug, Serialize)] pub enum CollectionKind {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_visitors.rs
line 27 at r1 (raw file):
#[derive(Debug)] enum ValueWithPrimitiveType {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_visitors.rs
line 39 at r1 (raw file):
#[derive(Default, Debug)] pub struct MetricDataVisitor {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_visitors.rs
line 67 at r1 (raw file):
impl Visit for MetricDataVisitor { // Required method
todo: Remove comments.
nativelink-metric-collector/src/metrics_visitors.rs
line 125 at r1 (raw file):
} pub struct SpanFields {
todo: Comment this struct.
nativelink-metric-collector/src/metrics_visitors.rs
line 130 at r1 (raw file):
impl Visit for SpanFields { // Required method
todo: Remove comment.
nativelink-metric-collector/src/otel_exporter.rs
line 1 at r1 (raw file):
use opentelemetry::metrics::Meter;
todo: Copyright.
nativelink-metric-collector/src/otel_exporter.rs
line 9 at r1 (raw file):
}; const MAX_METRIC_NAME_LENGTH: usize = 256;
todo: Comment this struct.
nativelink-metric-collector/src/otel_exporter.rs
line 11 at r1 (raw file):
const MAX_METRIC_NAME_LENGTH: usize = 256; pub fn otel_export(
todo: Comment this struct.
nativelink-metric-collector/src/tracing_layers.rs
line 1 at r1 (raw file):
use parking_lot::Mutex;
todo copyright.
nativelink-metric-collector/src/tracing_layers.rs
line 20 at r1 (raw file):
use crate::metrics_visitors::{MetricDataVisitor, SpanFields}; pub struct MetricsCollectorLayer<S> {
todo: Comment this struct.
nativelink-metric-collector/src/tracing_layers.rs
line 27 at r1 (raw file):
impl<S> MetricsCollectorLayer<S> { pub fn new() -> (Self, Arc<Mutex<RootMetricCollectedMetrics>>) {
todo: Comment this function.
nativelink-metric-collector/tests/metric_collector_test.rs
line 69 at r1 (raw file):
// } // TODO(FINISH TEST)
todo
nativelink-metric-tests/Cargo.toml
line 7 at r1 (raw file):
[dependencies] nativelink-metric = { path = "../nativelink-metric" }
todo: new line.
nativelink-metric-tests/src/lib.rs
line 45 at r1 (raw file):
// #[derive(MetricsComponent)] // struct FooBar(#[metric] u64);
todo: Remove this entire folder.
nativelink-scheduler/src/api_worker_scheduler.rs
line 53 at r1 (raw file):
} impl MetricsComponent for Workers {
todo: comment why this is not a derive-macro.
nativelink-scheduler/src/api_worker_scheduler.rs
line 487 at r1 (raw file):
impl RootMetricsComponent for ApiWorkerScheduler {} // impl MetricsComponent for ApiWorkerScheduler {
todo remove.
nativelink-scheduler/src/default_scheduler_factory.rs
line 78 at r1 (raw file):
}; // if let Some(scheduler_metrics) = maybe_scheduler_metrics {
todo remove.
nativelink-scheduler/src/platform_property_manager.rs
line 37 at r1 (raw file):
let _enter = group!("known_properties").entered(); for (k, v) in &self.known_properties { group!(k).in_scope(|| {
todo: This should use publish macro.
nativelink-scheduler/src/worker.rs
line 257 at r1 (raw file):
} // impl MetricsComponent for Worker {
todo! remove.
nativelink-store/src/fast_slow_store.rs
line 404 at r1 (raw file):
} // impl MetricsComponent for FastSlowStoreMetrics {
todo remove.
nativelink-store/src/filesystem_store.rs
line 926 at r1 (raw file):
} // impl<Fe: FileEntry> MetricsComponent for FilesystemStore<Fe> {
todo: remove.
nativelink-store/src/memory_store.rs
line 324 at r1 (raw file):
} // impl MetricsComponent for MemoryStore {
todo: remove.
nativelink-store/src/redis_store.rs
line 612 at r1 (raw file):
} // impl<C> MetricsComponent for RedisStore<C>
todo: Remove.
nativelink-store/src/size_partitioning_store.rs
line 168 at r1 (raw file):
} // impl MetricsComponent for SizePartitioningStore {
todo: remove.
nativelink-util/src/action_messages.rs
line 339 at r1 (raw file):
pub struct ActionUniqueKey { /// Name of instance group this action belongs to. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 342 at r1 (raw file):
pub instance_name: String, /// The digest function this action expects. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 345 at r1 (raw file):
pub digest_function: DigestHasherFunc, /// Digest of the underlying `Action`. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 357 at r1 (raw file):
pub struct ActionInfo { /// Digest of the underlying `Command`. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 360 at r1 (raw file):
pub command_digest: DigestInfo, /// Digest of the underlying `Directory`. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 363 at r1 (raw file):
pub input_root_digest: DigestInfo, /// Timeout of the action. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 369 at r1 (raw file):
pub platform_properties: PlatformProperties, /// The priority of the action. Higher value means it should execute faster. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 372 at r1 (raw file):
pub priority: i32, /// When this action started to be loaded from the CAS. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 375 at r1 (raw file):
pub load_timestamp: SystemTime, /// When this action was created. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 379 at r1 (raw file):
/// Info used to uniquely identify this ActionInfo and if it is cachable. /// This is primarily used to join actions/operations together using this key. #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 1144 at r1 (raw file):
#[derive(PartialEq, Debug, Clone, Serialize, Deserialize, MetricsComponent)] pub struct ActionState { #[metric]
todo: help
nativelink-util/src/action_messages.rs
line 1146 at r1 (raw file):
#[metric] pub stage: ActionStage, #[metric]
todo: help
nativelink-util/src/evicting_map.rs
line 496 at r1 (raw file):
} // impl<K: Ord + Hash + Eq + Clone + Debug, T: LenEntry + Debug, I: InstantWrapper> MetricsComponent
todo: Remove.
nativelink-util/src/metrics_utils.rs
line 80 at r1 (raw file):
field_metadata: MetricFieldData, ) -> Result<MetricPublishKnownKindData, nativelink_metric::Error> { let _enter = group!(field_metadata.name).entered();
todo: Comment why we do this instead of derive-macro.
nativelink-util/src/metrics_utils.rs
line 168 at r1 (raw file):
field_metadata: MetricFieldData, ) -> Result<MetricPublishKnownKindData, nativelink_metric::Error> { let _enter = group!(field_metadata.name).entered();
todo: Comment why we do this instead of derive-macro.
nativelink-util/src/metrics_utils.rs
line 341 at r1 (raw file):
field_metadata: MetricFieldData, ) -> Result<MetricPublishKnownKindData, nativelink_metric::Error> { let _enter = group!(field_metadata.name).entered();
todo: Comment why we do this instead of derive macro.
nativelink-util/src/platform_properties.rs
line 132 at r1 (raw file):
) -> Result<MetricPublishKnownKindData, nativelink_metric::Error> { let _enter = match self { Self::Exact(v) => group!("exact").in_scope(|| v.publish(kind, field_metadata)),
todo: Make this publish!()
;
nativelink-worker/src/local_worker.rs
line 643 at r1 (raw file):
} // impl MetricsComponent for Metrics {
todo Remove.
nativelink-worker/src/running_actions_manager.rs
line 1952 at r1 (raw file):
} // impl MetricsComponent for Metrics {
todo: Remove.
src/bin/nativelink.rs
line 64 at r1 (raw file):
use prometheus::{Encoder, TextEncoder}; use tracing_subscriber::layer::SubscriberExt; // use opentelemetry_sdk::metrics::data::Temporality;
todo: Remove.
src/bin/nativelink.rs
line 110 at r1 (raw file):
#[derive(MetricsComponent)] struct RootMetrics {
todo: Comment what this struct does.
src/bin/nativelink.rs
line 128 at r1 (raw file):
#[derive(Hash, PartialEq, Eq)] struct SocketAddrWrapper(SocketAddr);
todo: Comment why we wrap.
src/bin/nativelink.rs
line 172 at r1 (raw file):
let mut health_register_store = health_registry_lock.sub_builder(health_component_name.into()); // let store_metrics = root_store_metrics.sub_registry_with_prefix(&name);
todo: Remove.
src/bin/nativelink.rs
line 183 at r1 (raw file):
let mut worker_schedulers = HashMap::new(); if let Some(schedulers_cfg) = cfg.schedulers { // let root_scheduler_metrics = root_metrics_registry.sub_registry_with_prefix("schedulers");
todo: Remove.
src/bin/nativelink.rs
line 185 at r1 (raw file):
// let root_scheduler_metrics = root_metrics_registry.sub_registry_with_prefix("schedulers"); for (name, scheduler_cfg) in schedulers_cfg { // let scheduler_metrics = root_scheduler_metrics.sub_registry_with_prefix(&name);
todo: Remove.
src/bin/nativelink.rs
line 241 at r1 (raw file):
// Lock our registry as immutable and clonable. // let root_metrics_registry = Arc::new(AsyncMutex::new(root_metrics_registry)); // let store_metrics_handlers = Arc::new(store_metrics_handlers);
todo: Remove.
src/bin/nativelink.rs
line 445 at r1 (raw file):
// let root_metrics_registry = root_metrics_registry.clone(); // let store_metrics_handlers = store_metrics_handlers.clone();
todo: Remove.
src/bin/nativelink.rs
line 873 at r1 (raw file):
let worker_cfgs = cfg.workers.unwrap_or_default(); // let mut root_metrics_registry_guard = root_metrics_registry.lock().await; // let root_worker_metrics = root_metrics_registry_guard.sub_registry_with_prefix("workers");
todo: Remove.
src/bin/nativelink.rs
line 932 at r1 (raw file):
} // let worker_metrics = root_worker_metrics.sub_registry_with_prefix(&name); // local_worker.register_metrics(worker_metrics);
todo: Remove.
94c7c84
to
fe4c78c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@adam-singer
+@aaronmondal
+@caass
Ready for review, but some tests need to be finished and I need to get the linters happy.
Reviewable status: 0 of 3 LGTMs obtained, and pending CI: Publish image, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, vale, and 5 discussions need to be resolved (waiting on @aaronmondal, @adam-singer, and @caass)
a discussion (no related file):
I looked into using something like:
https://docs.rs/tracing-opentelemetry/latest/tracing_opentelemetry/
But this doesn't really play well with logs or any other data exporting. Under the hood this is doing very close to what the above library does, but in a way that allows the export of general data.
The library linked above has no dynamic configuration ability. Meaning, spans are not honored, so if we did metrics the way they want to do it, every MemoryStore
would be publishing the exact same metrics. The way it's done in this implementation is by taking advantage of tracing's spans and reading the layers during the events.
a discussion (no related file):
I also looked into plain otel, but they are really limited and you end up in the same problem of not being able to publish metrics with layer knowledge easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 75 files at r1, 5 of 40 files at r2.
Reviewable status: 0 of 3 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-22.04, and 5 discussions need to be resolved (waiting on @aaronmondal, @adam-singer, and @caass)
Cargo.toml
line 74 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
todo: check all of these to ensure we still need them.
These can be removed
tracing-opentelemetry = { version = "0.25.0", features = ["metrics"] }
opentelemetry-stdout = "0.5.0"
opentelemetry_api = { version = "0.20.0", features = ["metrics"] }
a12dca8
to
a34447d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 41 of 75 files at r1, 18 of 40 files at r2, 9 of 9 files at r3, 11 of 11 files at r4, all commit messages.
Reviewable status: 1 of 3 LGTMs obtained, and pending CI: Remote / large-ubuntu-22.04, integration-tests (20.04), integration-tests (22.04), and 5 discussions need to be resolved (waiting on @aaronmondal and @caass)
a discussion (no related file):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 3 LGTMs obtained, and pending CI: Analyze (javascript-typescript), pre-commit-checks, and 4 discussions need to be resolved (waiting on @aaronmondal and @caass)
nativelink-metric-collector/tests/metric_collector_test.rs
line 69 at r1 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
todo
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 3 LGTMs obtained, and pending CI: Analyze (javascript-typescript), pre-commit-checks, and 3 discussions need to be resolved (waiting on @aaronmondal and @caass)
Cargo.toml
line 74 at r1 (raw file):
Previously, zbirenbaum (Zach Birenbaum) wrote…
These can be removed
tracing-opentelemetry = { version = "0.25.0", features = ["metrics"] } opentelemetry-stdout = "0.5.0" opentelemetry_api = { version = "0.20.0", features = ["metrics"] }
Done.
3407987
to
d40d7e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 40 files at r2, 5 of 7 files at r5, 11 of 11 files at r6, all commit messages.
Reviewable status: 1 of 3 LGTMs obtained, and all files reviewed, and pending CI: Remote / large-ubuntu-22.04 (waiting on @aaronmondal and @caass)
Metrics got an entire overhaul. Instead of relying on a broken prometheus library to publish our metrics, we now use the `tracing` library and with OpenTelemetry that we bind together then publish into a prometheus library. Metrics are now mostly derive-macros. This means that the struct can express what it wants to export and a help text. The library will choose if it is able to export it. Tracing now works by calling `.publish()` on the parent structs, those structs need to call `.publish()` on all the child members it wishes to publish data about. If a "group" is requested, use the `group!()` macro, which under-the-hood calls `tracing::span` with some special labels. At primitive layers, it will call the `publish!()` macro, which will call `tracing::event!()` macro under-the-hood with some special fields set. A custom `tracing::Subscriber` will intercept all the events and spans and convert them into a json-like object. This object can then be exported as real json or encoded into other formats like otel/prometheus. closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209 towards: TraceMachina#206
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 34 of 75 files at r1, 17 of 40 files at r2, 3 of 9 files at r3, 7 of 11 files at r4, 4 of 7 files at r5, 8 of 11 files at r6, 11 of 11 files at r7, all commit messages.
Reviewable status: 1 of 3 LGTMs obtained, and all files reviewed (waiting on @aaronmondal and @caass)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 LGTMs obtained, and all files reviewed
Metrics got an entire overhaul. Instead of relying on a broken prometheus library to publish our metrics, we now use the `tracing` library and with OpenTelemetry that we bind together then publish into a prometheus library. Metrics are now mostly derive-macros. This means that the struct can express what it wants to export and a help text. The library will choose if it is able to export it. Tracing now works by calling `.publish()` on the parent structs, those structs need to call `.publish()` on all the child members it wishes to publish data about. If a "group" is requested, use the `group!()` macro, which under-the-hood calls `tracing::span` with some special labels. At primitive layers, it will call the `publish!()` macro, which will call `tracing::event!()` macro under-the-hood with some special fields set. A custom `tracing::Subscriber` will intercept all the events and spans and convert them into a json-like object. This object can then be exported as real json or encoded into other formats like otel/prometheus. closes: TraceMachina#1164, TraceMachina#650, TraceMachina#384, TraceMachina#209 towards: TraceMachina#206
Metrics got an entire overhaul. Instead of relying on a broken prometheus library to publish our metrics, we now use the
tracing
library and with OpenTelemetry that we bind together then publish into a prometheus library.Metrics are now mostly derive-macros. This means that the struct can express what it wants to export and a help text. The library will choose if it is able to export it.
Tracing now works by calling
.publish()
on the parent structs, those structs need to call.publish()
on all the child members it wishes to publish data about. If a "group" is requested, use thegroup!()
macro, which under-the-hood callstracing::span
with some special labels. At primitive layers, it will call thepublish!()
macro, which will calltracing::event!()
macro under-the-hood with some special fields set. A customtracing::Subscriber
will intercept all the events and spans and convert them into a json-like object. This object can then be exported as real json or encoded into other formats like otel/prometheus.closes: #1164, #650, #384, #209
towards: #206
Description
Please include a summary of the changes and the related issue. Please also
include relevant motivation and context.
Fixes # (issue)
Type of change
Please delete options that aren't relevant.
not work as expected)
How Has This Been Tested?
Please also list any relevant details for your test configuration
Checklist
bazel test //...
passes locallygit amend
see some docsThis change is