Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect deadlock in dbengine page cache #6911

Merged
merged 3 commits into from
Sep 24, 2019
Merged

Detect deadlock in dbengine page cache #6911

merged 3 commits into from
Sep 24, 2019

Conversation

mfundul
Copy link
Contributor

@mfundul mfundul commented Sep 23, 2019

Summary

Fixes #6910

Component Name

database/engine

Additional Information

Detect deadlock in dbengine page cache when there are too many metrics and print error message.

@squash-labs
Copy link

squash-labs bot commented Sep 23, 2019

Manage this branch in Squash

Test this branch here: https://mfunduldetect-and-log-deadlock-5z37v.squash.io

@cakrit
Copy link
Contributor

cakrit commented Sep 23, 2019

I know that this is what the user asked, but it looks like with this we'll leave netdata hanging and log how many error messages until someone realizes what happened and reconfigures it?

I believe we need a warning and then a critical alarm when we are getting close to the page cache limit. I'd also change the behavior to stop handling new metrics, so that we remain within the limit, always remaining in a critical alarm state.

No problem merging this if it won't fill the log with errors, but it definitely doesn't fix the problem, regardless of the specificity of the user request.

@KrisShannon
Copy link

KrisShannon commented Sep 23, 2019

TBH, When I said logging I just meant some kind of visible warning.

An alarm should be fine, especially if it also stops handling new metrics so the web UI remains responsive.

The issue is really that when it happened there was no clues about why it was hanging that someone without the skills to use gdb could see.

@mfundul
Copy link
Contributor Author

mfundul commented Sep 23, 2019

This will not fill the log but the deadlock will remain as it is currently.

@mfundul
Copy link
Contributor Author

mfundul commented Sep 23, 2019

With the latest commit it should now drop metrics so as to resolve the deadlock condition and raise a corresponding alarm. Netdata remains responsive.

Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mfundul ,

I began to test your PR now, and it was reporting a log of messages inside my error.log, before to bring the images that will show what I am saying, let me describe the tests that I did:

  • I compiled Netdata using: CFLAGS="-O1 -ggdb -Wall -Wextra -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1" ./netdata-installer.sh --disable-go --disable-lto --dont-wait
  • I changed the file /usr/libexec/netdata/python.d/example.chart.py to have 16000 dimensions instead 4, I did this changing the line: for i in range(1, 16000):
  • my netdata.conf had the following parameters:
memory mode = dbengine
page cache size = 32
dbengine disk space = 256

I ran Netdata during only 6 minutes with these configuration and I had more than 200000 messages:
err33

@mfundul
Copy link
Contributor Author

mfundul commented Sep 23, 2019

Hi @mfundul ,

I began to test your PR now, and it was reporting a log of messages inside my error.log, before to bring the images that will show what I am saying, let me describe the tests that I did:

* I compiled Netdata using: `CFLAGS="-O1 -ggdb -Wall -Wextra -Wformat-signedness -fstack-protector-all -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1" ./netdata-installer.sh --disable-go --disable-lto --dont-wait`

* I changed the file `/usr/libexec/netdata/python.d/example.chart.py` to have 16000 dimensions instead 4, I did this changing the line: `for i in range(1, 16000):`

* my netdata.conf had the following parameters:
memory mode = dbengine
page cache size = 32
dbengine disk space = 256

I ran Netdata during only 6 minutes with these configuration and I had more than 200000 messages:
err33

I've changed it to only print once per dbengine instance.

Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After few hours running I could verify that I received only one error message when the problem reported in the issue happened.
After to adjust the dbengine variables inside netdata.conf to use hundred of variables, I also could verify that the error was not reported.
In this last group of tests that I ran, I used more or less the same configuration that I used in the previous tests, the unique exception was the fact that instead to use 16K charts, this time I used 160.

@mfundul mfundul merged commit 2728be8 into netdata:master Sep 24, 2019
@mfundul mfundul deleted the detect-and-log-deadlock-dbengine branch September 24, 2019 08:59
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request Oct 9, 2019
* Detect deadlock in dbengine page cache when there are too many metrics and print error message

* Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms

* Changed printing deadlock errors to only happen once per dbengine instance
jackyhuang85 pushed a commit to jackyhuang85/netdata that referenced this pull request Jan 1, 2020
* Detect deadlock in dbengine page cache when there are too many metrics and print error message

* Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms

* Changed printing deadlock errors to only happen once per dbengine instance
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request May 21, 2020
* Detect deadlock in dbengine page cache when there are too many metrics and print error message

* Resolve dbengine deadlock by dropping metrics when page cache is too small and define relevant alarms

* Changed printing deadlock errors to only happen once per dbengine instance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBengine hanging without logging
5 participants