-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics to identify audit logging failures #2863
Comments
I just had this issue in production this morning. This morning we had a failure in our audit backend(the file was rotated, but the -HUP signal did not happen for some reason). vault log: 2017/06/19 05:41:59.431861 [ERROR] audit: backend failed to log response: backend=file/ error=write vault_audit.log: bad file descriptor
yet, the health checks were continuing to pass, despite error 500's on any read/write call to vault. This seems.. wrong somehow :) Ideally, I'd love if vault on a bad FD would just try a re-open like if a HUP signal happened. Especially since audit is crucial to a valid operating vault system. also, the /health check should probably at least WARN, if not outright FAIL, if it can't write the audit log for whatever reason. |
Addresses a pain point from #2863 (comment)
@edjackson-wf Any chance you can tell me if you think https://github.com/hashicorp/vault/pull/3001/files meets your needs? I figured that an incrementing counter is probably the right way to go. |
@jefferai Yes, I think that would work for me. I suppose it might be worth considering the case where multiple audit backends are enabled and one fails. Is it worth distinguishing between audit failures that cause requests to fail and those that don't? It's not my use case, but it seems plausible. |
Addresses a pain point from #2863 (comment)
@edjackson-wf We can add more specific metrics later if needed, but I'd argue that any time that counter is going up continually there's a bad situation just waiting to happen, regardless of which backend is currently experiencing the problem. At that point logs will tell the rest. |
@jefferai Fair enough. Thanks a bunch for adding this. |
No problem! |
It would be very helpful to be able to alert an operator when there are audit logging failures. Because this information isn't available from the /sys/health endpoint, we need some other means.
I would suggest the addition of some appropriate metrics in the telemetry, so alerting can be done from statsd/statsite.
I don't have strong feelings about exactly what the metrics should be. Being able to monitor 500 response codes would certainly help, or maybe the audit logging backends should provide more specific error metrics.
See also this conversation.
The text was updated successfully, but these errors were encountered: