Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics suggestion: backup jobs, replication jobs #112

Open
steveej opened this issue May 3, 2022 · 7 comments
Open

metrics suggestion: backup jobs, replication jobs #112

steveej opened this issue May 3, 2022 · 7 comments

Comments

@steveej
Copy link

steveej commented May 3, 2022

hey @znerol, thank you for creating this helpful exporter 🙌

i'd like to track and set up alerts for failed or absent backups, replications, and on high IO delay (the one that's displayed in the webui for each node).

cheers 👋

@znerol
Copy link
Member

znerol commented May 3, 2022

This exporter is using the PVE REST API. Looking through the API docs I have found the following interesting routes possibly covering your requirements (at least partly):

absent backups:
cluster/backup-info/not-backet-up lists all guests (qemu and lxc) which are not covered by any backup plan.
failed backups:
Maybe this is extractable from /cluster/backup.
failed replications:
Maybe this is extractable from /cluster/replication

Regarding high IO delay I recommend to take a look at node_exporter. For node level metrics, this is usually the better option.

@steveej
Copy link
Author

steveej commented May 4, 2022

thanks @znerol

cluster/backup-info/not-backet-up lists all guests (qemu and lxc) which are not covered by any backup plan.

while i originally meant backup jobs who for some reason didn't execute, i also like the idea of alerting when a VM doesn't have a backup job at all.

for the rest i'll also have a look at the API to see which items would be useful to add.

Regarding high IO delay I recommend to take a look at node_exporter. For node level metrics, this is usually the better option.

indeed, thanks! i thought PVE was doing something special but according to the frontend code it evaluates the system's wait load, which can be gathered otherwise.

@xziy
Copy link

xziy commented Oct 1, 2023

Hello everyone, is there any progress? I faced a similar problem. I need to know which machines were left without backup, or there was an error.

@StarkZarn
Copy link

IO wait would be a very useful metric to have, IMO, if possible -- especially for those using ZFS for backing storage.

@znerol
Copy link
Member

znerol commented Feb 20, 2024

IO wait would be a very useful metric to have, IMO, if possible -- especially for those using ZFS for backing storage.

Please use node_exporter for the iowait metric. Take a look at this blog post for a start.

@znerol znerol changed the title metrics suggestion: backup jobs, replication jobs, and IO delay metrics suggestion: backup jobs, replication jobs Feb 20, 2024
@StarkZarn
Copy link

IO wait would be a very useful metric to have, IMO, if possible -- especially for those using ZFS for backing storage.

Please use node_exporter for the iowait metric. Take a look at this blog post for a start.

Thank you!

znerol added a commit that referenced this issue Apr 27, 2024
Add replication metrics as requested in issue #112.

* Replication Metrics are fetched per node
* The metrics can be enabled or disabled

Based on the original PR #166 adapted the new file structure.

---------

Signed-off-by: Sven Gerber <[email protected]>
Co-authored-by: znerol <[email protected]>
Co-authored-by: Marian Koreniuk <[email protected]>
@znerol
Copy link
Member

znerol commented Apr 27, 2024

Thenks to @svengerber and @themoriarti, replication metrics are available as of release v3.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants