No metrics on authorization failure #55

leahoswald · 2020-12-22T10:03:50Z

I observed some random outages of our metrics for all proxmox nodes during incidents with only a single affected node. The log is showing this authorization failure triggered by a request done in the collector.py to get the VM metrics. So I think it happens if a node is reported as online but isn't correctly reachable or has other problems. This should be handled by a try/except logic to allow all other metrics to get correctly collected. I'll add a pull request for that.

Dec 21 00:00:14 node5 pve_exporter[476752]: Exception thrown while rendering view
Dec 21 00:00:14 node5 pve_exporter[476752]: Traceback (most recent call last):
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 101, in view
Dec 21 00:00:14 node5 pve_exporter[476752]:     return self._views[endpoint](**params)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 53, in on_pve
Dec 21 00:00:14 node5 pve_exporter[476752]:     output = collect_pve(self._config[module], target)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 301, in collect_pve
Dec 21 00:00:14 node5 pve_exporter[476752]:     return generate_latest(registry)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/exposition.py", line 90, in generate_latest
Dec 21 00:00:14 node5 pve_exporter[476752]:     for metric in registry.collect():
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/registry.py", line 75, in collect
Dec 21 00:00:14 node5 pve_exporter[476752]:     for metric in collector.collect():
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 272, in collect
Dec 21 00:00:14 node5 pve_exporter[476752]:     for vmdata in self._pve.nodes(node['node']).qemu.get():
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 84, in get
Dec 21 00:00:14 node5 pve_exporter[476752]:     return self(args)._request("GET", params=params)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 79, in _request
Dec 21 00:00:14 node5 pve_exporter[476752]:     resp.content))
Dec 21 00:00:14 node5 pve_exporter[476752]: proxmoxer.core.ResourceException: 401 Unauthorized: b''

The text was updated successfully, but these errors were encountered:

znerol · 2020-12-22T10:11:43Z

Thanks for the report. Can you tell whether the exception is crashing pve_exporter? I.e., are there any sings that subsequent scrapes are executed with another pid?

leahoswald · 2020-12-22T10:20:07Z

From what I can see there is no crash, only the exception and the fact that no metrics are rendered. But looking for that I found another exception in the same part of the code:

Dec 21 10:39:26 node5 pve_exporter[476752]: Exception thrown while rendering view
Dec 21 10:39:26 node5 pve_exporter[476752]: Traceback (most recent call last):
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 101, in 
Dec 21 10:39:26 node5 pve_exporter[476752]:     return self._views[endpoint](**params)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 53, in o
Dec 21 10:39:26 node5 pve_exporter[476752]:     output = collect_pve(self._config[module], target)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 301
Dec 21 10:39:26 node5 pve_exporter[476752]:     return generate_latest(registry)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/exposition.py", li
Dec 21 10:39:26 node5 pve_exporter[476752]:     for metric in registry.collect():
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/registry.py", line
Dec 21 10:39:26 node5 pve_exporter[476752]:     for metric in collector.collect():
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 272
Dec 21 10:39:26 node5 pve_exporter[476752]:     for vmdata in self._pve.nodes(node['node']).qemu.get():
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 84, in get
Dec 21 10:39:26 node5 pve_exporter[476752]:     return self(args)._request("GET", params=params)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 78, in _req
Dec 21 10:39:26 node5 pve_exporter[476752]:     raise ResourceException("{0} {1}: {2}".format(resp.status_code, httplib.respons
Dec 21 10:39:26 node5 pve_exporter[476752]: KeyError: 595

znerol · 2020-12-22T10:28:05Z

It seems to me that the proxmoxer library might be a bit out of date. At least after commit 4b2f7167 those status codes shouldn't result in such an exception. Would it be possible to update proxmoxer on the affected hosts?

leahoswald · 2020-12-22T10:38:31Z

Well you are right, I though I had updated it a while ago but somehow I didn't and was stuck with the minor version before that fix^^
But if I read the mentioned commit right, it only fixes the second error not the first one, right?

znerol · 2020-12-22T11:04:45Z

They also fixed some authentication related issues in the last couple of releases, thus it might help with the first one as well.

If it doesn't, then I'd need a bit more info on how the scraping is setup in your case. Namely:

Do you run pve_exporter on every proxmox node?
Do you scrape every node? Or just a subset? Or even just one node in order to gather the metrics?
Do you have any relevant relabel_configs and/or metric_relabel_configs in place in the prometheus configuration?

leahoswald · 2020-12-22T11:13:05Z

I think we have to wait than if this happens again. This will take some time because the problem only occurs rarely as the cluster is very stable most of the time. But I can still answer your question.
Currently we run the exporter on every node but only one node is scraped (I think I will change this in the future). And currently there is no relabeling config in place.

znerol · 2020-12-22T11:15:29Z

Ok, also see the comments in #54 for the relabeling thing. I need to document this in the wiki, at some point.

leahoswald · 2021-02-19T14:53:05Z

Ok...moving the discussion from PR 56 over here. The problem is that if a host behind the node, where the pve-exporter gets its data from (api node), fails the api query you use in the affected code could fail for that host. This has the effect that the whole scrape fails for all nodes even if all other nodes, including the api node, are still available. To prevent this we need to catch the exception. Then we can still get api data for all other nodes while the problematic node could fail not affecting the other requests.

znerol · 2021-02-19T17:50:52Z

Thanks for the explanation and I apologize that I was a bit slow on the uptake. I believe that this is the same issue as in #30 but I apparently failed to fix that completely. I tried to address this in #31 by checking whether a node is available before trying to access the lxc/qemu configurations. But apparently this is not enough 🙄 .

I am suspecting the following failure mode which the previous fix did not take into account: If the status of a node changes from online to offline mid-loop, then that will break the whole scrape as reported in #30 and as you observed as well. Considering that there are performance issues with that piece of code as reported in #58, the chances for this kind of failures are certainly there.

I've published v2.1.0 earlier today with new command line flags. It is now possible to run pve_exporter with the --no-collector.config flag which will deactivate the buggy collector. The only metric this collector contributes is pve_onboot_status. Thus its usefulness is limited anyway. Given all those weird issues with that piece of code, I am considering to deactivate it by default in a future release.

That said, if you like to roll another PR, I'd be willing to include one if it replaces the incomplete fix from #31. I.e, just replace if node["status"] == "online": with try: and then catch the exception after the second loop (lxc). Instead of swallowing it, log it the same way as in http.py.

Thanks again for reporting the issue and also thanks for insisting.

leahoswald · 2021-02-19T18:30:35Z

I will look if I find the time to update my PR next week. But only to be sure, the issue I observe is not limited to only one scrape while the status changes, it can persist for hours.

znerol · 2021-02-20T07:52:25Z

it can persist for hours.

Interesting. Out of curiosity: When that happens, can you still use the PVE web interface on the api node to see/edit all the quemu/lxc guests on other nodes?

Also when the scrape succeeds. How long does it take? Easiest way to check that (at least on linux and mac) is to scrape manually (e.g. using curl) and measure using the time tool: time curl http://pve-node.example.org:9221/pve

znerol · 2021-03-03T20:31:46Z

Opened #63 crediting @leahoswald for the fix. Do you think that works for you?

leahoswald · 2021-03-04T09:47:57Z

Looks great, thanks!

znerol · 2021-03-05T08:00:20Z

Rolled this into 2.1.1. Thanks for the report and sorry again it took so long for me to understand the problem.

leahoswald mentioned this issue Dec 22, 2020

add exception handling for RessourceException #56

Closed

znerol closed this as completed Mar 5, 2021

znerol mentioned this issue Mar 30, 2021

timeout issues #66

Closed

znerol mentioned this issue Mar 19, 2022

pve_disk_usage_bytes on all qemu guest is null #77

Open

znerol mentioned this issue Jun 13, 2022

metrics suggestion: limits for ide, sata, scsi, virtio #115

Open

znerol mentioned this issue Aug 1, 2023

Collect hard disk information #155

Closed

znerol mentioned this issue Oct 1, 2023

add class ClusterReplicationCollector #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No metrics on authorization failure #55

No metrics on authorization failure #55

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Feb 19, 2021

znerol commented Feb 19, 2021

leahoswald commented Feb 19, 2021

znerol commented Feb 20, 2021

znerol commented Mar 3, 2021

leahoswald commented Mar 4, 2021

znerol commented Mar 5, 2021

No metrics on authorization failure #55

No metrics on authorization failure #55

Comments

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Dec 22, 2020

znerol commented Dec 22, 2020

leahoswald commented Feb 19, 2021

znerol commented Feb 19, 2021

leahoswald commented Feb 19, 2021

znerol commented Feb 20, 2021

znerol commented Mar 3, 2021

leahoswald commented Mar 4, 2021

znerol commented Mar 5, 2021