Skip to content

Metricspedia

Carlo Cabanilla edited this page Jan 2, 2014 · 5 revisions

System check

System checks are implemented per OS:

Memory

We pull system.mem.free directly from what the OS gives us, and we precompute system.mem.used as system.mem.total - system.mem.free. To exclude cached memory, you can subtract cached from the used:

avg:system.mem.used{host:myhost} - avg:system.mem.cached{host:myhost}

We also provide a convenience metric called system.mem.usable, which is the sum of free, buffered and cached, with the assumption that the OS will give up buffered and cached memory to other apps that need it if necessary. That metric is also available as a percentage as system.mem.pct_usable, which is useful for alerting on.

If you're interested in seeing how we compute these memory metrics, this is a link to the code

Load

The system.load family of metrics are collected from the operating system and provide a high level metric for how backed up the machine's cpu is. The number roughly means how many processes are waiting for cpu time in the last N minutes, where N corresponds to the number value of the load metric, ie. system.load.5 refers to the last 5 minutes.

A healthy system should have a load value of about the number of cpus it has. That means the cpus are well-utilized without being overloaded. It's worth noting that since machines have many cpus these days, a load of 4 for example could be good or bad, depending on how many cpus that machine has. For convenience, we've created a derived metric family, system.load.norm, which is system.load divided by the number of cpus on that machine. This value is useful for alerting on, since you always know that values greater than 1 are bad.

If you're interested in seeing how we compute these load metrics, this is a link to the code

Clone this wiki locally