Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No metrics on authorization failure #55

Closed
leahoswald opened this issue Dec 22, 2020 · 14 comments
Closed

No metrics on authorization failure #55

leahoswald opened this issue Dec 22, 2020 · 14 comments

Comments

@leahoswald
Copy link
Contributor

I observed some random outages of our metrics for all proxmox nodes during incidents with only a single affected node. The log is showing this authorization failure triggered by a request done in the collector.py to get the VM metrics. So I think it happens if a node is reported as online but isn't correctly reachable or has other problems. This should be handled by a try/except logic to allow all other metrics to get correctly collected. I'll add a pull request for that.

Dec 21 00:00:14 node5 pve_exporter[476752]: Exception thrown while rendering view
Dec 21 00:00:14 node5 pve_exporter[476752]: Traceback (most recent call last):
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 101, in view
Dec 21 00:00:14 node5 pve_exporter[476752]:     return self._views[endpoint](**params)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 53, in on_pve
Dec 21 00:00:14 node5 pve_exporter[476752]:     output = collect_pve(self._config[module], target)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 301, in collect_pve
Dec 21 00:00:14 node5 pve_exporter[476752]:     return generate_latest(registry)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/exposition.py", line 90, in generate_latest
Dec 21 00:00:14 node5 pve_exporter[476752]:     for metric in registry.collect():
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/registry.py", line 75, in collect
Dec 21 00:00:14 node5 pve_exporter[476752]:     for metric in collector.collect():
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 272, in collect
Dec 21 00:00:14 node5 pve_exporter[476752]:     for vmdata in self._pve.nodes(node['node']).qemu.get():
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 84, in get
Dec 21 00:00:14 node5 pve_exporter[476752]:     return self(args)._request("GET", params=params)
Dec 21 00:00:14 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 79, in _request
Dec 21 00:00:14 node5 pve_exporter[476752]:     resp.content))
Dec 21 00:00:14 node5 pve_exporter[476752]: proxmoxer.core.ResourceException: 401 Unauthorized: b''
@znerol
Copy link
Member

znerol commented Dec 22, 2020

Thanks for the report. Can you tell whether the exception is crashing pve_exporter? I.e., are there any sings that subsequent scrapes are executed with another pid?

@leahoswald
Copy link
Contributor Author

From what I can see there is no crash, only the exception and the fact that no metrics are rendered. But looking for that I found another exception in the same part of the code:

Dec 21 10:39:26 node5 pve_exporter[476752]: Exception thrown while rendering view
Dec 21 10:39:26 node5 pve_exporter[476752]: Traceback (most recent call last):
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 101, in 
Dec 21 10:39:26 node5 pve_exporter[476752]:     return self._views[endpoint](**params)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/http.py", line 53, in o
Dec 21 10:39:26 node5 pve_exporter[476752]:     output = collect_pve(self._config[module], target)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 301
Dec 21 10:39:26 node5 pve_exporter[476752]:     return generate_latest(registry)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/exposition.py", li
Dec 21 10:39:26 node5 pve_exporter[476752]:     for metric in registry.collect():
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/prometheus_client/registry.py", line
Dec 21 10:39:26 node5 pve_exporter[476752]:     for metric in collector.collect():
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/pve_exporter/collector.py", line 272
Dec 21 10:39:26 node5 pve_exporter[476752]:     for vmdata in self._pve.nodes(node['node']).qemu.get():
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 84, in get
Dec 21 10:39:26 node5 pve_exporter[476752]:     return self(args)._request("GET", params=params)
Dec 21 10:39:26 node5 pve_exporter[476752]:   File "/usr/local/lib/python3.7/dist-packages/proxmoxer/core.py", line 78, in _req
Dec 21 10:39:26 node5 pve_exporter[476752]:     raise ResourceException("{0} {1}: {2}".format(resp.status_code, httplib.respons
Dec 21 10:39:26 node5 pve_exporter[476752]: KeyError: 595

@znerol
Copy link
Member

znerol commented Dec 22, 2020

It seems to me that the proxmoxer library might be a bit out of date. At least after commit 4b2f7167 those status codes shouldn't result in such an exception. Would it be possible to update proxmoxer on the affected hosts?

@leahoswald
Copy link
Contributor Author

Well you are right, I though I had updated it a while ago but somehow I didn't and was stuck with the minor version before that fix^^
But if I read the mentioned commit right, it only fixes the second error not the first one, right?

@znerol
Copy link
Member

znerol commented Dec 22, 2020

They also fixed some authentication related issues in the last couple of releases, thus it might help with the first one as well.

If it doesn't, then I'd need a bit more info on how the scraping is setup in your case. Namely:

  • Do you run pve_exporter on every proxmox node?
  • Do you scrape every node? Or just a subset? Or even just one node in order to gather the metrics?
  • Do you have any relevant relabel_configs and/or metric_relabel_configs in place in the prometheus configuration?

@leahoswald
Copy link
Contributor Author

I think we have to wait than if this happens again. This will take some time because the problem only occurs rarely as the cluster is very stable most of the time. But I can still answer your question.
Currently we run the exporter on every node but only one node is scraped (I think I will change this in the future). And currently there is no relabeling config in place.

@znerol
Copy link
Member

znerol commented Dec 22, 2020

Ok, also see the comments in #54 for the relabeling thing. I need to document this in the wiki, at some point.

@leahoswald
Copy link
Contributor Author

Ok...moving the discussion from PR 56 over here. The problem is that if a host behind the node, where the pve-exporter gets its data from (api node), fails the api query you use in the affected code could fail for that host. This has the effect that the whole scrape fails for all nodes even if all other nodes, including the api node, are still available. To prevent this we need to catch the exception. Then we can still get api data for all other nodes while the problematic node could fail not affecting the other requests.

@znerol
Copy link
Member

znerol commented Feb 19, 2021

Thanks for the explanation and I apologize that I was a bit slow on the uptake. I believe that this is the same issue as in #30 but I apparently failed to fix that completely. I tried to address this in #31 by checking whether a node is available before trying to access the lxc/qemu configurations. But apparently this is not enough 🙄 .

I am suspecting the following failure mode which the previous fix did not take into account: If the status of a node changes from online to offline mid-loop, then that will break the whole scrape as reported in #30 and as you observed as well. Considering that there are performance issues with that piece of code as reported in #58, the chances for this kind of failures are certainly there.

I've published v2.1.0 earlier today with new command line flags. It is now possible to run pve_exporter with the --no-collector.config flag which will deactivate the buggy collector. The only metric this collector contributes is pve_onboot_status. Thus its usefulness is limited anyway. Given all those weird issues with that piece of code, I am considering to deactivate it by default in a future release.

That said, if you like to roll another PR, I'd be willing to include one if it replaces the incomplete fix from #31. I.e, just replace if node["status"] == "online": with try: and then catch the exception after the second loop (lxc). Instead of swallowing it, log it the same way as in http.py.

Thanks again for reporting the issue and also thanks for insisting.

@leahoswald
Copy link
Contributor Author

I will look if I find the time to update my PR next week. But only to be sure, the issue I observe is not limited to only one scrape while the status changes, it can persist for hours.

@znerol
Copy link
Member

znerol commented Feb 20, 2021

it can persist for hours.

Interesting. Out of curiosity: When that happens, can you still use the PVE web interface on the api node to see/edit all the quemu/lxc guests on other nodes?

Also when the scrape succeeds. How long does it take? Easiest way to check that (at least on linux and mac) is to scrape manually (e.g. using curl) and measure using the time tool: time curl http://pve-node.example.org:9221/pve

@znerol
Copy link
Member

znerol commented Mar 3, 2021

Opened #63 crediting @leahoswald for the fix. Do you think that works for you?

@leahoswald
Copy link
Contributor Author

Looks great, thanks!

@znerol
Copy link
Member

znerol commented Mar 5, 2021

Rolled this into 2.1.1. Thanks for the report and sorry again it took so long for me to understand the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants