-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent container metrics in prometheus route #1704
Comments
We're observing the same behavior, in the kubernetes 1.7.0 kubelet (port 4194), and the docker image for Versions:
I ran cadvisor on kubernetes using the following DaemonSet apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: cadvisor
namespace: default
labels:
app: "cadvisor"
spec:
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app: "cadvisor"
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: 4194
prometheus.io/path: '/metrics'
spec:
containers:
- name: "cadvisor"
image: "google/cadvisor:v0.26.1"
args:
- "-port=4194"
- "-logtostderr"
livenessProbe:
httpGet:
path: /api
port: 4194
volumeMounts:
- name: root
mountPath: /rootfs
readOnly: true
- name: var-run
mountPath: /var/run
- name: sys
mountPath: /sys
readOnly: true
- name: var-lib-docker
mountPath: /var/lib/docker
readOnly: true
- name: docker-socket
mountPath: /var/run/docker.sock
resources:
limits:
cpu: 500.0m
memory: 256Mi
requests:
cpu: 250.0m
memory: 128Mi
restartPolicy: Always
volumes:
- name: "root"
hostPath:
path: /
- name: "var-run"
hostPath:
path: /var/run
- name: "sys"
hostPath:
path: /sys
- name: "var-lib-docker"
hostPath:
path: /var/lib/docker
- name: "docker-socket"
hostPath:
path: /var/run/docker.sock |
Running the binary without root permissions fixes the problems, but now container labels are missing. Using the |
@zeisss @micahhausler are you both running Prometheus 2.0? In 1.x versions the flapping metrics are not caught by the new staleness handling and thus it should have no immediately visible effect. In general it's definitely wrong behavior by cAdvisor though that violates the /metrics contract. |
@fabxc I'm using Prometheus 1.5.2 and cAdvisor on host machine and I also have this problem. Worst of all with this bug is that Prometheus sometimes lose some containers metrics... In Grafana my graph with running containers looks like this: And I see Alerts from AlertManager that containers is down, but actually all containers working all time. |
We currently have a workaround by running cadvisor as an explicit user. this is ok for us, as having the CPU and memory graphs is already a win for us. But afaict this mode is missing the docker container labels as well as network and disk I/O metrics. |
@fabxc no, we are still running a 1.x prometheus version - but having Prometheus work around this bug in cadvisor is not a good solution IMO. |
We are currently in the progress of updating our DEV cluster to Docker 17.06-ce where we are still seeing this behavior, if run as root (
|
I've the same issue with Kubernetes 1.7.2 & 1.7.3. |
I have the exact same problem as @dexterHD, makes me crazy, my container-down alert spams me with false alerts all the time. |
Having the same issue with docker 17.06, prometheus and docker swarm. |
cc @grobie |
/cc |
Same thing, 0.26, 0.26.1 are unusable with Prometheus (in our case 1.7.x). |
@Hermain @roman-vynar According to release notes 0.26 "Bug: Fix prometheus metrics." Do we have an ETA on fixing this? No devs in this issue? And no assignee? |
According to #1690 (comment) the fix in 0.26.1 isn't working / incomplete, maybe this is the same problem? |
Does this problem happen on a cAdvisor built from master, which includes #1679? |
Thanks. |
I meet the same promblem using cadvisor 0.26.1 and prometheus 1.7.1, but it's OK when I changed cadvisor to v0.25.0, and it OK with cadvisor 0.26.1 and prometheus 1.5.3, I'm a little confused, it seems to be a compatibility issue. |
Seeing the same high-level symptoms: for me it's the labels that are missing, not the containers. And when the labels are missing I get a lot more lines for other cgroups. I'm running kuberntes 1.7.3 on Ubuntu Two examples from the same kubelet on the same machine, a few seconds apart: Example 1:
Example 2:
Different metrics in the same scrape will be fine, e.g. |
I think I figured out what is going wrong. The function However, when it receives the metrics, Prometheus checks that all metrics in the same family have the same label set, and rejects those that do not. Since containers are collected in (somewhat) random order, depending on which kind is seen first you get one set of metrics or the other. Changing the container labels function to always add the same set of labels, adding |
Thanks @bboreham! Can you submit a PR with your fix? I will try and get this in the 1.8 release. |
For those stuck on 0.25.0 because of this issue, I've cherry-picked (04fc089) the patch to kube-state-metrics mentioned above (#1704 (comment)) onto cadvisor's local copy of NB: this is merely a workaround until a proper fix is available in a release ! |
Prometheus requires that all metrics in the same family have the same labels, so we arrange to supply blank strings for missing labels See google/cadvisor#1704
Prometheus requires that all metrics in the same family have the same labels, so we arrange to supply blank strings for missing labels See google/cadvisor#1704
Prometheus requires that all metrics in the same family have the same labels, so we arrange to supply blank strings for missing labels See google/cadvisor#1704
We're observing the same behavior with version |
After several discussions I had with various people, I came to the conclusion we want to support "label filling" within the Prometheus Go client. You can track progress here: prometheus/client_golang#355 |
I've looked into this, and there looks to be a simpler solution. I believe that using the approach at kubernetes/kubernetes#51473 in Cadvisor would be sufficient to resolve the issue here. That is in Is there something I'm missing? |
Ah, I see. It's the |
I've put together #1831 which I believe will fix this. |
The fix is released in version v0.28.3 |
Thank you all |
Our cadvisor reports different containers each we time I query the /metrics route. The problems are consistent across various environments and VMs. I initially found #1635 and thought this to be the same, but the linked #1572 explains that cadvisor seems to pickup two systemd slices for the container, which is not the case according to my logs. Thus a separate issue, just to be sure.
:8701
is started as follows:$ sudo /opt/cadvisor/bin/cadvisor -port 8701 -logtostderr -v=10
Neither dockerd nor cadvisor print any logs during those requests.
Startup Logs
Logs for a container
Example: I am missing the metrics for
f7ba91df74c8
. Cadvisor mentions the container ID only once:System
We are running an old docker swarm setup with consul, consul-template and nginx per host. No Kubernetes.
The text was updated successfully, but these errors were encountered: