-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No metrics on authorization failure #55
Comments
Thanks for the report. Can you tell whether the exception is crashing |
From what I can see there is no crash, only the exception and the fact that no metrics are rendered. But looking for that I found another exception in the same part of the code:
|
It seems to me that the |
Well you are right, I though I had updated it a while ago but somehow I didn't and was stuck with the minor version before that fix^^ |
They also fixed some authentication related issues in the last couple of releases, thus it might help with the first one as well. If it doesn't, then I'd need a bit more info on how the scraping is setup in your case. Namely:
|
I think we have to wait than if this happens again. This will take some time because the problem only occurs rarely as the cluster is very stable most of the time. But I can still answer your question. |
Ok, also see the comments in #54 for the relabeling thing. I need to document this in the wiki, at some point. |
Ok...moving the discussion from PR 56 over here. The problem is that if a host behind the node, where the pve-exporter gets its data from (api node), fails the api query you use in the affected code could fail for that host. This has the effect that the whole scrape fails for all nodes even if all other nodes, including the api node, are still available. To prevent this we need to catch the exception. Then we can still get api data for all other nodes while the problematic node could fail not affecting the other requests. |
Thanks for the explanation and I apologize that I was a bit slow on the uptake. I believe that this is the same issue as in #30 but I apparently failed to fix that completely. I tried to address this in #31 by checking whether a node is available before trying to access the lxc/qemu configurations. But apparently this is not enough 🙄 . I am suspecting the following failure mode which the previous fix did not take into account: If the status of a node changes from online to offline mid-loop, then that will break the whole scrape as reported in #30 and as you observed as well. Considering that there are performance issues with that piece of code as reported in #58, the chances for this kind of failures are certainly there. I've published That said, if you like to roll another PR, I'd be willing to include one if it replaces the incomplete fix from #31. I.e, just replace Thanks again for reporting the issue and also thanks for insisting. |
I will look if I find the time to update my PR next week. But only to be sure, the issue I observe is not limited to only one scrape while the status changes, it can persist for hours. |
Interesting. Out of curiosity: When that happens, can you still use the PVE web interface on the api node to see/edit all the quemu/lxc guests on other nodes? Also when the scrape succeeds. How long does it take? Easiest way to check that (at least on linux and mac) is to scrape manually (e.g. using |
Opened #63 crediting @leahoswald for the fix. Do you think that works for you? |
Looks great, thanks! |
Rolled this into 2.1.1. Thanks for the report and sorry again it took so long for me to understand the problem. |
I observed some random outages of our metrics for all proxmox nodes during incidents with only a single affected node. The log is showing this authorization failure triggered by a request done in the collector.py to get the VM metrics. So I think it happens if a node is reported as online but isn't correctly reachable or has other problems. This should be handled by a try/except logic to allow all other metrics to get correctly collected. I'll add a pull request for that.
The text was updated successfully, but these errors were encountered: