-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect deadlock in dbengine page cache #6911
Detect deadlock in dbengine page cache #6911
Conversation
…s and print error message
Manage this branch in SquashTest this branch here: https://mfunduldetect-and-log-deadlock-5z37v.squash.io |
I know that this is what the user asked, but it looks like with this we'll leave netdata hanging and log how many error messages until someone realizes what happened and reconfigures it? I believe we need a warning and then a critical alarm when we are getting close to the page cache limit. I'd also change the behavior to stop handling new metrics, so that we remain within the limit, always remaining in a critical alarm state. No problem merging this if it won't fill the log with errors, but it definitely doesn't fix the problem, regardless of the specificity of the user request. |
TBH, When I said logging I just meant some kind of visible warning. An alarm should be fine, especially if it also stops handling new metrics so the web UI remains responsive. The issue is really that when it happened there was no clues about why it was hanging that someone without the skills to use gdb could see. |
This will not fill the log but the deadlock will remain as it is currently. |
…small and define relevant alarms
With the latest commit it should now drop metrics so as to resolve the deadlock condition and raise a corresponding alarm. Netdata remains responsive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mfundul ,
I began to test your PR now, and it was reporting a log of messages inside my error.log, before to bring the images that will show what I am saying, let me describe the tests that I did:
- I compiled Netdata using:
CFLAGS="-O1 -ggdb -Wall -Wextra -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1" ./netdata-installer.sh --disable-go --disable-lto --dont-wait
- I changed the file
/usr/libexec/netdata/python.d/example.chart.py
to have 16000 dimensions instead 4, I did this changing the line:for i in range(1, 16000):
- my netdata.conf had the following parameters:
memory mode = dbengine
page cache size = 32
dbengine disk space = 256
I ran Netdata during only 6 minutes with these configuration and I had more than 200000 messages:
I've changed it to only print once per dbengine instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After few hours running I could verify that I received only one error message when the problem reported in the issue happened.
After to adjust the dbengine variables inside netdata.conf to use hundred of variables, I also could verify that the error was not reported.
In this last group of tests that I ran, I used more or less the same configuration that I used in the previous tests, the unique exception was the fact that instead to use 16K charts, this time I used 160.
* Detect deadlock in dbengine page cache when there are too many metrics and print error message * Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms * Changed printing deadlock errors to only happen once per dbengine instance
* Detect deadlock in dbengine page cache when there are too many metrics and print error message * Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms * Changed printing deadlock errors to only happen once per dbengine instance
* Detect deadlock in dbengine page cache when there are too many metrics and print error message * Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms * Changed printing deadlock errors to only happen once per dbengine instance
Summary
Fixes #6910
Component Name
database/engine
Additional Information
Detect deadlock in dbengine page cache when there are too many metrics and print error message.