Implement operational metrics #105

Pokom · 2024-02-08T19:52:27Z

Unlike our cost metrics, operational metrics should be uniform across all providers and collectors. This makes dashboards and alerts a lot easier to setup, e.g. you only have one set of alerts instead of an N*M alerts(where N is providers, and M is collectors). This work has already begun with #104 which can be refactored to belong to the providers package and used by both AWS and GCP. While the current set is a nice foundation, we can extend this even further to:

cloudcost_exporter_collector_api_requests_total
cloudcost_exporter_collector_api_requests_errors_total
cloudcost_exporter_collector_api_requests_duration_seconds
cloudcost_exporter_collector_last_scrape_time

We'd likely need the following labels:

provider => CSP
collector => Module that's making the request
service => The backend system being called(compute, storage, billing, costexplorer, etc)
method => The method(ListInstancesInZone, GetServiceName, GetCostUsage)

Once we do this, we can update our existing operational dashboard(https://admin-ops-us-east-0.grafana-ops.net/grafana/d/1a9c0de366458599246184cf0ae8b468/cloudcost-exporter-overview?orgId=1) to use the generic metrics instead of the provider specific ones.

The text was updated successfully, but these errors were encountered:

Pokom · 2024-02-08T20:09:34Z

Just to say, naming is hard and I'm not married to the idea of it being ...collector_api_requests..., I just don't have a better way of communicating external requests.

In order to better track freshness of data, this PR adds a few more operational metrics: - `cloudcost_exporter_collector_last_scrape_time` - `cloudcost_exporter_last_scrape_time` The intent of these is to export in unix time the last time a scrape was performed. This can be used to alert in prometheus when the last_scrape_time was say > 60m. This also implements in AWS the operational metrics that GCP implemented so that we have feature parity between the two. In the future it would make sense to generalize this to a common interface so that new providers do not need to implement the same metrics. - refs #5 + #105

Pokom · 2024-07-09T14:23:21Z

This is likely related work to #222 and could follow similar implementations.

Pokom changed the title ~~Define operational metrics~~ Implement operational metrics Feb 8, 2024

Pokom mentioned this issue Apr 30, 2024

feat(gcp+aws): Add last_scrape_time metric #159

Merged

Pokom added good first issue Good for newcomers area/capacity and removed good first issue Good for newcomers labels Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement operational metrics #105

Implement operational metrics #105

Pokom commented Feb 8, 2024 •

edited

Loading

Pokom commented Feb 8, 2024

Pokom commented Jul 9, 2024

Implement operational metrics #105

Implement operational metrics #105

Comments

Pokom commented Feb 8, 2024 • edited Loading

Pokom commented Feb 8, 2024

Pokom commented Jul 9, 2024

Pokom commented Feb 8, 2024 •

edited

Loading