-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Monitoring): Adding Monitoring for Disk Space and Number of Backups #7404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @gja, @manishrjain, and @vvbalaji-dgraph)
x/disk_metrics_others.go, line 12 at r1 (raw file):
"File System Metrics are not currently supported on non-linux platforms"
File system metrics are not currently supported on non-Linux platforms
Suggestion: change the capitalization.
x/metrics.go, line 66 at r1 (raw file):
// NumBackupsSuccess is the number of backups successfully completed NumBackupsSuccess = stats.Int64("num_backups_success", "Total number of backups completed", stats.UnitDimensionless)
We should add a metric for backup failures too so we can set up alerts for when backups fail. It can happen for whatever reason (context timeouts, EOF errors). Backup failures can't be caught on the client-side if the backup request takes too long, causing the client request to timeout while the backup is still pending on the Dgraph side.
x/metrics.go, line 440 at r1 (raw file):
// Exposing metrics at /metrics, which is the usual standard, as well as at the old endpoint http.Handle("/metrics", pe) http.Handle("/debug/prometheus_metrics", pe)
Should /debug/prometheus_metrics
be a 301 redirect to /metrics
? Would a redirect still work for existing Prometheus configs that refer to /debug/prometheus_metrics
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @gja, @manishrjain, and @vvbalaji-dgraph)
x/metrics.go, line 66 at r1 (raw file):
"Total number of backups requested", stats.UnitDimensionless) // NumBackupsSuccess is the number of backups successfully completed NumBackupsSuccess = stats.Int64("num_backups_success",
The suffix for this metric should be _total
to follow Prometheus's naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @gja, @manishrjain, and @vvbalaji-dgraph)
dgraph/cmd/zero/run.go, line 346 at r1 (raw file):
zero_wal_dir
We can call this wal
. It's already known to be Zero's wal directory since this metric is coming from Zero.
worker/server_state.go, line 136 at r1 (raw file):
// Commenting this out because Badger is doing its own cache checks. go x.MonitorCacheHealth(s.Pstore, s.gcCloser) go x.MonitorDiskMetrics("posting_dir", Config.PostingDir, s.gcCloser)
posting_dir
can be renamed to postings
. This follows the long-form flag name for setting the p directory ("postings").
x/metrics.go, line 96 at r1 (raw file):
disk_free
This should be called disk_free_bytes
x/metrics.go, line 99 at r1 (raw file):
disk_used
This should be called disk_used_bytes
x/metrics.go, line 102 at r1 (raw file):
disk_total
This should be called disk_bytes_total
x/metrics.go, line 163 at r1 (raw file):
dir_type
This can be dir
. This is the same naming scheme used in Badger metrics. e.g., https://github.com/dgraph-io/dgraph/blob/d79d3797bf90d9bb5a57b434a741302962ea5d84/x/metrics.go#L424-L428
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defer to @danielmai for final approval.
Reviewed 6 of 6 files at r2.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @danielmai, @gja, and @vvbalaji-dgraph)
worker/server_state.go, line 136 at r1 (raw file):
Previously, danielmai (Daniel Mai) wrote…
posting_dir
can be renamed topostings
. This follows the long-form flag name for setting the p directory ("postings").
postings-fs
so it's clear it's at a filesystem level.
This change is