Detect deadlock in dbengine page cache #6911

mfundul · 2019-09-23T06:25:53Z

Summary

Fixes #6910

Component Name

database/engine

Additional Information

Detect deadlock in dbengine page cache when there are too many metrics and print error message.

…s and print error message

squash-labs · 2019-09-23T06:25:57Z

Manage this branch in Squash

Test this branch here: https://mfunduldetect-and-log-deadlock-5z37v.squash.io

cakrit · 2019-09-23T07:52:43Z

I know that this is what the user asked, but it looks like with this we'll leave netdata hanging and log how many error messages until someone realizes what happened and reconfigures it?

I believe we need a warning and then a critical alarm when we are getting close to the page cache limit. I'd also change the behavior to stop handling new metrics, so that we remain within the limit, always remaining in a critical alarm state.

No problem merging this if it won't fill the log with errors, but it definitely doesn't fix the problem, regardless of the specificity of the user request.

KrisShannon · 2019-09-23T08:05:56Z

TBH, When I said logging I just meant some kind of visible warning.

An alarm should be fine, especially if it also stops handling new metrics so the web UI remains responsive.

The issue is really that when it happened there was no clues about why it was hanging that someone without the skills to use gdb could see.

mfundul · 2019-09-23T08:29:46Z

This will not fill the log but the deadlock will remain as it is currently.

…small and define relevant alarms

mfundul · 2019-09-23T10:21:18Z

With the latest commit it should now drop metrics so as to resolve the deadlock condition and raise a corresponding alarm. Netdata remains responsive.

thiagoftsm

Hi @mfundul ,

I began to test your PR now, and it was reporting a log of messages inside my error.log, before to bring the images that will show what I am saying, let me describe the tests that I did:

I compiled Netdata using: CFLAGS="-O1 -ggdb -Wall -Wextra -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1" ./netdata-installer.sh --disable-go --disable-lto --dont-wait
I changed the file /usr/libexec/netdata/python.d/example.chart.py to have 16000 dimensions instead 4, I did this changing the line: for i in range(1, 16000):
my netdata.conf had the following parameters:

memory mode = dbengine
page cache size = 32
dbengine disk space = 256

I ran Netdata during only 6 minutes with these configuration and I had more than 200000 messages:

…tance

mfundul · 2019-09-23T13:04:10Z

Hi @mfundul ,

I began to test your PR now, and it was reporting a log of messages inside my error.log, before to bring the images that will show what I am saying, let me describe the tests that I did:
* I compiled Netdata using: `CFLAGS="-O1 -ggdb -Wall -Wextra -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1" ./netdata-installer.sh --disable-go --disable-lto --dont-wait`

* I changed the file `/usr/libexec/netdata/python.d/example.chart.py` to have 16000 dimensions instead 4, I did this changing the line: `for i in range(1, 16000):`

* my netdata.conf had the following parameters:
memory mode = dbengine
page cache size = 32
dbengine disk space = 256
I ran Netdata during only 6 minutes with these configuration and I had more than 200000 messages:

I've changed it to only print once per dbengine instance.

thiagoftsm

After few hours running I could verify that I received only one error message when the problem reported in the issue happened.
After to adjust the dbengine variables inside netdata.conf to use hundred of variables, I also could verify that the error was not reported.
In this last group of tests that I ran, I used more or less the same configuration that I used in the previous tests, the unique exception was the fact that instead to use 16K charts, this time I used 160.

* Detect deadlock in dbengine page cache when there are too many metrics and print error message * Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms * Changed printing deadlock errors to only happen once per dbengine instance

Detect deadlock in dbengine page cache when there are too many metric…

6c8f759

…s and print error message

mfundul added the area/database label Sep 23, 2019

mfundul added this to the v1.18-Sprint2 milestone Sep 23, 2019

mfundul requested review from cakrit, cosmix and thiagoftsm as code owners September 23, 2019 06:25

mfundul mentioned this pull request Sep 23, 2019

DBengine hanging without logging #6910

Closed

Resolve dbengine deadlock by dropping metrics when page cache is too …

20daeb3

…small and define relevant alarms

mfundul requested review from ktsaou and vlvkobal as code owners September 23, 2019 10:17

thiagoftsm requested changes Sep 23, 2019

View reviewed changes

Changed printing deadlock errors to only happen once per dbengine ins…

b4f08b4

…tance

netdatabot added area/daemon area/health labels Sep 23, 2019

thiagoftsm self-requested a review September 23, 2019 17:52

thiagoftsm approved these changes Sep 23, 2019

View reviewed changes

cakrit approved these changes Sep 24, 2019

View reviewed changes

mfundul merged commit 2728be8 into netdata:master Sep 24, 2019

mfundul deleted the detect-and-log-deadlock-dbengine branch September 24, 2019 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect deadlock in dbengine page cache #6911

Detect deadlock in dbengine page cache #6911

mfundul commented Sep 23, 2019

squash-labs bot commented Sep 23, 2019

cakrit commented Sep 23, 2019

KrisShannon commented Sep 23, 2019 •

edited

Loading

mfundul commented Sep 23, 2019

mfundul commented Sep 23, 2019

thiagoftsm left a comment

mfundul commented Sep 23, 2019

thiagoftsm left a comment

Detect deadlock in dbengine page cache #6911

Detect deadlock in dbengine page cache #6911

Conversation

mfundul commented Sep 23, 2019

Summary

Component Name

Additional Information

squash-labs bot commented Sep 23, 2019

Manage this branch in Squash

cakrit commented Sep 23, 2019

KrisShannon commented Sep 23, 2019 • edited Loading

mfundul commented Sep 23, 2019

mfundul commented Sep 23, 2019

thiagoftsm left a comment

Choose a reason for hiding this comment

mfundul commented Sep 23, 2019

thiagoftsm left a comment

Choose a reason for hiding this comment

KrisShannon commented Sep 23, 2019 •

edited

Loading