forked from netdata/netdata
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Detect deadlock in dbengine page cache (netdata#6911)
* Detect deadlock in dbengine page cache when there are too many metrics and print error message * Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms * Changed printing deadlock errors to only happen once per dbengine instance
- Loading branch information
1 parent
795b8ba
commit c8976d7
Showing
7 changed files
with
97 additions
and
32 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,51 @@ | ||
|
||
# you can disable an alarm notification by setting the 'to' line to: silent | ||
|
||
alarm: 10min_dbengine_global_fs_errors | ||
on: netdata.dbengine_global_errors | ||
os: linux freebsd macos | ||
hosts: * | ||
lookup: sum -10m unaligned of FS errors | ||
units: errors | ||
every: 10s | ||
crit: $this > 0 | ||
delay: down 15m multiplier 1.5 max 1h | ||
info: number of File-System errors dbengine came across the last 10 minutes (too many open files, wrong permissions etc) | ||
to: sysadmin | ||
alarm: 10min_dbengine_global_fs_errors | ||
on: netdata.dbengine_global_errors | ||
os: linux freebsd macos | ||
hosts: * | ||
lookup: sum -10m unaligned of FS errors | ||
units: errors | ||
every: 10s | ||
crit: $this > 0 | ||
delay: down 15m multiplier 1.5 max 1h | ||
info: number of File-System errors dbengine came across the last 10 minutes (too many open files, wrong permissions etc) | ||
to: sysadmin | ||
|
||
alarm: 10min_dbengine_global_io_errors | ||
on: netdata.dbengine_global_errors | ||
os: linux freebsd macos | ||
hosts: * | ||
lookup: sum -10m unaligned of I/O errors | ||
units: errors | ||
every: 10s | ||
crit: $this > 0 | ||
delay: down 1h multiplier 1.5 max 3h | ||
info: number of IO errors dbengine came across the last 10 minutes (CRC errors, out of space, bad disk etc) | ||
to: sysadmin | ||
alarm: 10min_dbengine_global_io_errors | ||
on: netdata.dbengine_global_errors | ||
os: linux freebsd macos | ||
hosts: * | ||
lookup: sum -10m unaligned of I/O errors | ||
units: errors | ||
every: 10s | ||
crit: $this > 0 | ||
delay: down 1h multiplier 1.5 max 3h | ||
info: number of IO errors dbengine came across the last 10 minutes (CRC errors, out of space, bad disk etc) | ||
to: sysadmin | ||
|
||
alarm: 10min_dbengine_global_page_cache_errors | ||
on: netdata.dbengine_global_errors | ||
os: linux freebsd macos | ||
hosts: * | ||
units: errors | ||
every: 10s | ||
lookup: sum -10m unaligned of Page-Cache errors | ||
crit: $this > 0 | ||
repeat: warning 120s critical 10s | ||
delay: down 1h multiplier 1.5 max 3h | ||
info: number of deadlocks dbengine resolved the last 10 minutes due to insufficient page cache size, metrics have been lost | ||
to: sysadmin | ||
|
||
alarm: 10min_dbengine_global_page_cache_warnings | ||
on: netdata.dbengine_global_errors | ||
os: linux freebsd macos | ||
hosts: * | ||
units: errors | ||
every: 10s | ||
lookup: sum -10m unaligned of Page-Cache warnings | ||
warn: $this > 0 | ||
delay: down 1h multiplier 1.5 max 3h | ||
info: number of times dbengine almost deadlocked the last 10 minutes due to insufficient page cache size | ||
to: sysadmin |