A monitoring solution for Docker hosts and containers with: Prometheus, Grafana, cAdvisor, NodeExporter and alerting with: AlertManager, AlertManager-BOT-telegram
ADMIN_USER=admin ADMIN_PASSWORD=admin docker-compose up -d
docker-compose up -d --force-recreate --build
Prerequisites:
* Docker Engine >= 1.13
* Docker Compose >= 1.11
Containers:
* Prometheus (metrics database) `http://<host-ip>:9090`
* Prometheus-Pushgateway (push acceptor for ephemeral and batch jobs) `http://<host-ip>:9091`
* AlertManager (alerts management) `http://<host-ip>:9093`
* Grafana (visualize metrics) `http://<host-ip>:3000`
* NodeExporter (host metrics collector)
* cAdvisor (containers metrics collector)
* Caddy (reverse proxy and basic auth provider for prometheus and alertmanager)
## Setup Grafana
Navigate to `http://<host-ip>:3000` and login with user ***admin*** password ***admin***. You can change the credentials in the compose file or by supplying the `ADMIN_USER` and `ADMIN_PASSWORD` environment variables on compose up. The config file can be added directly in grafana part like this
grafana: image: grafana/grafana:5.2.4 env_file: - config
and the config file format should have this content
GF_SECURITY_ADMIN_USER=admin GF_SECURITY_ADMIN_PASSWORD=changeme GF_USERS_ALLOW_SIGN_UP=false
If you want to change the password, you have to remove this entry, otherwise the change will not take effect
- grafana_data:/var/lib/grafana
Grafana is preconfigured with dashboards and Prometheus as the default data source:
* Name: Prometheus
* Type: Prometheus
* Url: http://prometheus:9090
* Access: proxy
## Define alerts
Three alert groups have been setup within the [alert.rules](https://github.com/stefanprodan/dockprom/blob/master/prometheus/alert.rules) configuration file:
* Monitoring services alerts [targets](https://github.com/stefanprodan/dockprom/blob/master/prometheus/alert.rules#L2-L11)
* Docker Host alerts [host](https://github.com/stefanprodan/dockprom/blob/master/prometheus/alert.rules#L13-L40)
* Docker Containers alerts [containers](https://github.com/stefanprodan/dockprom/blob/master/prometheus/alert.rules#L42-L69)
You can modify the alert rules and reload them by making a HTTP POST call to Prometheus:
curl -X POST http://admin:admin@:9090/-/reload
***Monitoring services alerts***
Trigger an alert if any of the monitoring targets (node-exporter and cAdvisor) are down for more than 30 seconds:
```yaml
- alert: monitor_service_down
expr: up == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Monitor service non-operational"
description: "Service {{ $labels.instance }} is down."
Docker Host alerts
Trigger an alert if the Docker host CPU is under high load for more than 30 seconds:
- alert: high_cpu_load
expr: node_load1 > 1.5
for: 30s
labels:
severity: warning
annotations:
summary: "Server under high load"
description: "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
Modify the load threshold based on your CPU cores.
Trigger an alert if the Docker host memory is almost full:
- alert: high_memory_load
expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes) ) / sum(node_memory_MemTotal_bytes) * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server memory is almost full"
description: "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
Trigger an alert if the Docker host storage is almost full:
- alert: high_storage_load
expr: (node_filesystem_size_bytes{fstype="aufs"} - node_filesystem_free_bytes{fstype="aufs"}) / node_filesystem_size_bytes{fstype="aufs"} * 100 > 85
for: 30s
labels:
severity: warning
annotations:
summary: "Server storage is almost full"
description: "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}."
Docker Containers alerts
Trigger an alert if a container is down for more than 30 seconds:
- alert: jenkins_down
expr: absent(container_memory_usage_bytes{name="jenkins"})
for: 30s
labels:
severity: critical
annotations:
summary: "Jenkins down"
description: "Jenkins container is down for more than 30 seconds."
Trigger an alert if a container is using more than 10% of total CPU cores for more than 30 seconds:
- alert: jenkins_high_cpu
expr: sum(rate(container_cpu_usage_seconds_total{name="jenkins"}[1m])) / count(node_cpu_seconds_total{mode="system"}) * 100 > 10
for: 30s
labels:
severity: warning
annotations:
summary: "Jenkins high CPU usage"
description: "Jenkins CPU usage is {{ humanize $value}}%."
Trigger an alert if a container is using more than 1.2GB of RAM for more than 30 seconds:
- alert: jenkins_high_memory
expr: sum(container_memory_usage_bytes{name="jenkins"}) > 1200000000
for: 30s
labels:
severity: warning
annotations:
summary: "Jenkins high memory usage"
description: "Jenkins memory consumption is at {{ humanize $value}}."
The AlertManager service is responsible for handling alerts sent by Prometheus server. AlertManager can send notifications via email, Pushover, Slack, HipChat or any other system that exposes a webhook interface. A complete list of integrations can be found here.
You can view and silence notifications by accessing http://<host-ip>:9093
.
The notification receivers can be configured in alertmanager/config.yml file.
To receive alerts via Slack you need to make a custom integration by choose incoming web hooks in your Slack team app page. You can find more details on setting up Slack integration here.
The pushgateway is used to collect data from batch jobs or from services.
To push data, simply execute:
echo "some_metric 3.14" | curl --data-binary @- http://user:password@localhost:9091/metrics/job/some_job
Please replace the user:password
part with your user and password set in the initial configuration (default: admin:admin
).