Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131

MarErm27 · 2021-10-30T17:16:39Z

Symptom 1

Version

21.10.0

Steps to Reproduce

Install 21.9.0
Collect CPU metrics
Upgrade to 21.10.0
Notice an increased CPU usage

Expected Result

CPU usage remains the same

Actual Result

CPU usage has increased which caused AWS CPU credits balance to drop.

Symptom 2

Version

21.11.0.dev0

Steps to Reproduce

Sentry worked for a week, but now it started to spam on everyone's emails with internal issues. How to fix them?

Expected Result

No internal issues were expected

Actual Result

Internal issues spam on everyone's emails

chadwhitacre · 2021-11-03T13:36:34Z

Looks like you lost your connection to Clickhouse. Why is that? Is this still ongoing?

MarErm27 · 2021-11-08T05:36:01Z

Hello, @chadwhitacre !
I noticed that some issues don't repeat and some are getting repeated constantly:

The one that is getting repeated, called "SoftTimeLimitExceeded()"

It seems that it's releted to Redis. Here are Redis logs, I don't see any issues here

By the way, I'm new to Sentry, where I can find information what for Sentry uses ClickHouse and Redis and other containers?

chadwhitacre · 2021-11-09T20:34:25Z

information what for Sentry uses ClickHouse and Redis and other containers?

Sounds like a +1 for #789. 😬

chadwhitacre · 2021-11-22T23:43:59Z

SoftTimeLimitExceeded

#1165 (comment) suggests this is due to underprovisioning.

MarErm27 · 2021-11-23T06:13:57Z

SoftTimeLimitExceeded

#1165 (comment) suggests this is due to underprovisioning.

Hello, @chadwhitacre, thank you for your reply!

MarErm27 · 2021-11-23T06:36:11Z

@chadwhitacre
I just noticed that 6 days ago it stopped working. No new issues have been recorded in any project after that

MarErm27 · 2021-11-23T06:44:52Z

@chadwhitacre I restarted everything and it works again now. How to avoid this in future? Sentry silently stopped storing errors

chadwhitacre · 2021-11-23T13:40:16Z

How to avoid this in future?

Have you considered migrating to SaaS? :)

chadwhitacre · 2021-11-23T15:31:21Z

Possibly related to #1141? 🤔

chadwhitacre · 2021-11-23T22:27:23Z

From @renchap on #1141:

I checked my internal project and we also have those SoftTimeLimitExceeded errors, and they seem to correlate with the CPU spikes.
It looks like a job is launched periodically, uses all the CPU and then gets terminated with SoftTimeLimitExceeded.

Folding that into this, then.

relaxolotl · 2021-11-24T17:41:03Z

Thanks for the detailed report! I'm currently taking a look at what's happening, will follow-up with some more info and possibly questions once I'm properly caught up on the details.

In the meanwhile, it looks like this is related to something that we've added in recently: sentry writes to a redis store with some basic measurements which help the processing pipeline in applying some backpressure when certain projects are behaving poorly. The timeouts are related to that redis store, which based on the volume of errors you're seeing is simply not responding at all.

relaxolotl · 2021-11-24T21:28:01Z

It looks like you've bumped into a bit of a bug that was indeed introduced in 21.10.0 and 21.11.0, my apologies. If you (or other users) are still seeing these errors after a restart, a quick fix would be to add

del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"]

to your sentry/sentry.conf.py, and restarting. This should de-register the task that's producing all of these errors. If this isn't working, please let me know and we can try to figure something out. I'll work on a long-term fix for this in the meanwhile.

As @chadwhitacre has noted above, #1165 (comment) 's observation that this may be caused by underprovisioning is a possible root cause for this behaviour. The operation that's timing out is expected to operate on a very small data set, hence its extremely short timeout. Another possible issue is that some setup related to the redis store was missing which was repaired by a restart.

Regardless of the amount of resources allocated to sentry, the feature that's spitting out these errors needs to be disabled for ~~on-premise~~ self-hosted configurations given the scale they're expected to operate at. Simple, minimal setups have no need for the feature that's triggering these issues.

As mentioned above, I'll be following up on this with a fix on sentry itself which should disable this feature entirely for self-hosted users. I'll follow-up on this issue once that's complete, which should also include a tentative version number that contains the fix.

renchap · 2021-11-28T20:28:37Z

@relaxolotl I applied your change and I no longer see the CPU spikes nor the Python exceptions, but the CPU usage is much higher all the time now:

Over the last month:

First increase is the update to 21.10.0, and the second one is the configuration change.

I am also seeing the disk usage increasing a lot since the update to 21.10.0, I dont know if this related:

I am also seeing a lot of those errors in the logs:

cron_1                                      | celery beat v4.4.7 (cliffs) is starting.
cron_1                                      | __    -    ... __   -        _
cron_1                                      | LocalTime -> 2021-11-28 20:50:58
cron_1                                      | Configuration ->
cron_1                                      |     . broker -> redis://redis:6379/0
cron_1                                      |     . loader -> celery.loaders.app.AppLoader
cron_1                                      |     . scheduler -> celery.beat.PersistentScheduler
cron_1                                      |     . db -> /tmp/sentry-celerybeat
cron_1                                      |     . logfile -> [stderr]@%WARNING
cron_1                                      |     . maxinterval -> 5.00 minutes (300s)
cron_1                                      | Traceback (most recent call last):
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/kombu/utils/objects.py", line 42, in __get__
cron_1                                      |     return obj.__dict__[self.__name__]
cron_1                                      | KeyError: 'scheduler'
cron_1                                      |
cron_1                                      | During handling of the above exception, another exception occurred:
cron_1                                      |
cron_1                                      | Traceback (most recent call last):
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/apps/beat.py", line 109, in start_scheduler
cron_1                                      |     service.start()
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 622, in start
cron_1                                      |     humanize_seconds(self.scheduler.max_interval))
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/kombu/utils/objects.py", line 44, in __get__
cron_1                                      |     value = obj.__dict__[self.__name__] = self.__get(obj)
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 666, in scheduler
cron_1                                      |     return self.get_scheduler()
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 657, in get_scheduler
cron_1                                      |     return symbol_by_name(self.scheduler_cls, aliases=aliases)(
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 501, in __init__
cron_1                                      |     Scheduler.__init__(self, *args, **kwargs)
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 257, in __init__
cron_1                                      |     self.setup_schedule()
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 544, in setup_schedule
cron_1                                      |     self.merge_inplace(self.app.conf.beat_schedule)
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 454, in merge_inplace
cron_1                                      |     entry = self.Entry(**dict(b[key], name=key, app=self.app))
cron_1                                      | TypeError: 'NoneType' object is not iterable
cron_1                                      | 20:50:58 [CRITICAL] celery.beat: beat raised exception <class 'TypeError'>: TypeError("'NoneType' object is not iterable")

There have been not major changes in usage or trafic since the update.

flub · 2021-11-30T17:10:58Z

It looks like you've bumped into a bit of a bug that was indeed introduced in 21.10.0 and 21.11.0, my apologies. If you (or other users) are still seeing these errors after a restart, a quick fix would be to add
CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"] = None

I wonder if this would have been better to do:

del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"]

as a hotfix instead and whether that is still the reason for your traceback. not sure if this would affect cpu or not

relaxolotl · 2021-11-30T17:19:47Z

Oh darn, that actually explains the errors in the logs. My apologies @renchap, to reiterate what @flub said could you replace

CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"] = None

with

del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"]

?

I'll update my original comment to reflect this. The del should remove the errors related to TypeError: 'NoneType' object is not iterable.

relaxolotl · 2021-11-30T17:35:03Z

As a general heads up, if no errors are being spit out related to Celery/Celerybeat after the hotfix is applied but CPU usage is still increased, it'll be a little bit more difficult to investigate the root cause of the increased CPU usage due to the nature of self-hosted releases. This is because self-hosted releases batch all changes to sentry made on a monthly basis (e.g. 21.10.0 has all of the changes from mid-Sept to mid-Oct, see https://develop.sentry.dev/self-hosted/releases/)

In other words, I may need to pull in somebody else or other people to narrow things down because there's so much bundled into every release (see https://github.com/getsentry/sentry/releases/tag/21.10.0).

cc @chadwhitacre in case I've gotten any details wrong

renchap · 2021-11-30T17:45:25Z

Thanks @relaxolotl and @flub, it works 👍

It is a bit too early to tell, but it looks like the CPU usage also dropped to the previous levels

(the last spike is a full restart to ensure all was correctly deployed)

I will have a look tomorrow with a few hours of data.

chadwhitacre · 2021-11-30T18:07:44Z

Love the progress here! 👏

@relaxolotl We actually have quite fine-grained versioning for self-hosted in addition to the monthly CalVer releases. Cloning the self-hosted repo and running install.sh will get the very latest version. Ask away if you need more debugging info! :)

renchap · 2021-12-02T13:53:33Z

So after a few days of data, I can confirm that:

the change in sentry.py no longer causes a large increase in average CPU
but version 21.10.0 still seems to use more CPU than 21.9.0
and we still have CPU spikes periodically, probably caused by a periodic job, which are triggering our monitoring as they use close to full CPU available. This did not happen with 21.9.0

Last month CPU usage:

(upgrade to 21.10.0 was on the 18th, invalid change in sentry.py on the 28th, correct change in sentry.py on the 30th.

Last 4 hours CPU usage:

renchap · 2021-12-08T08:56:40Z

Any news on this?

chadwhitacre · 2021-12-13T19:29:19Z

@relaxolotl I think we have some changes that have landed in the nightlies by now, yes? New monthly release is scheduled for the day after tomorrow.

relaxolotl · 2021-12-15T02:06:26Z

Yep, there're changes in the nightlies (and just tagged release) which should improve things out of the box. If the CPU spikes are showing up despite the aforementioned task being disabled and especially if it still happens as of the most recent monthly release, the root cause may be out of my purview; I'm afraid I'm not too familiar with most of the other changes on sentry that could be causing this. What do you think of possibly bringing this up with the backend team(s) internally @chadwhitacre?

chadwhitacre · 2021-12-15T02:12:36Z

@renchap What version are you currently running? If you're not on a nightly (i.e., latest master) then let's have you test out 21.12.0 once that goes out tomorrow.

@relaxolotl If we hear back from @renchap that this isn't resolved, then I agree, let's put our heads together internally and see what we can come up with for next steps.

Thanks, gang!

renchap · 2021-12-16T11:10:27Z

I just updated to 21.12.0, we will see how it goes.

As a side note, you need to remove the del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"] line in your sentry.conf.py before upgrading, otherwise it will fail.

renchap · 2021-12-17T10:59:04Z

After 1 day, it seems that the CPU usage is back to what it was with 21.9.0 on my instance.

@MarErm27 can you update to 21.12.0 and see if you still see the issue on yours?

chadwhitacre · 2021-12-20T19:55:50Z

Good news, @renchap!

I'm going to go ahead and close this out on the strength of your testimony. If someone else wants to chime in and we need to reopen we can do so.

Thanks, everyone!

getsantry bot added the Status: Untriaged label Oct 30, 2021

chadwhitacre added the Status: Needs More Information label Nov 3, 2021

getsantry bot removed the Status: Untriaged label Nov 3, 2021

chadwhitacre changed the title ~~Internal issues~~ Seeing SoftTimeLimitExceeded repeatedly Nov 23, 2021

chadwhitacre mentioned this issue Nov 23, 2021

Sentry error #1165

Closed

renchap mentioned this issue Nov 23, 2021

Increased CPU usage with 21.10.0 #1141

Closed

chadwhitacre changed the title ~~Seeing SoftTimeLimitExceeded repeatedly~~ Periodic job soaks CPU, dies with SoftTimeLimitExceeded Nov 23, 2021

chadwhitacre added the Type: Bug label Nov 23, 2021

chadwhitacre pinned this issue Nov 23, 2021

chadwhitacre mentioned this issue Nov 23, 2021

sentry 21.11.0 - web interface does not work #1154

Closed

chadwhitacre added Status: Backlog and removed Status: Needs More Information labels Nov 23, 2021

chadwhitacre mentioned this issue Nov 30, 2021

SoftTimeLimitExceeded getsentry/sentry#30250

Closed

chadwhitacre mentioned this issue Dec 15, 2021

Release 21.12.0 #1193

Closed

3 tasks

chadwhitacre removed the Status: Backlog label Dec 20, 2021

chadwhitacre closed this as completed Dec 20, 2021

chadwhitacre unpinned this issue Dec 20, 2021

github-actions bot locked and limited conversation to collaborators Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131

Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131

MarErm27 commented Oct 30, 2021 •

edited by chadwhitacre

Loading

chadwhitacre commented Nov 3, 2021

MarErm27 commented Nov 8, 2021

chadwhitacre commented Nov 9, 2021

chadwhitacre commented Nov 22, 2021

MarErm27 commented Nov 23, 2021

MarErm27 commented Nov 23, 2021 •

edited

Loading

MarErm27 commented Nov 23, 2021 •

edited

Loading

chadwhitacre commented Nov 23, 2021

chadwhitacre commented Nov 23, 2021

chadwhitacre commented Nov 23, 2021

relaxolotl commented Nov 24, 2021

relaxolotl commented Nov 24, 2021 •

edited

Loading

renchap commented Nov 28, 2021 •

edited

Loading

flub commented Nov 30, 2021

relaxolotl commented Nov 30, 2021

relaxolotl commented Nov 30, 2021

renchap commented Nov 30, 2021 •

edited

Loading

chadwhitacre commented Nov 30, 2021 •

edited

Loading

renchap commented Dec 2, 2021

renchap commented Dec 8, 2021

chadwhitacre commented Dec 13, 2021

relaxolotl commented Dec 15, 2021

chadwhitacre commented Dec 15, 2021 •

edited

Loading

renchap commented Dec 16, 2021

renchap commented Dec 17, 2021

chadwhitacre commented Dec 20, 2021

Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131

Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131

Comments

MarErm27 commented Oct 30, 2021 • edited by chadwhitacre Loading

Symptom 1

Version

Steps to Reproduce

Expected Result

Actual Result

Symptom 2

Version

Steps to Reproduce

Expected Result

Actual Result

chadwhitacre commented Nov 3, 2021

MarErm27 commented Nov 8, 2021

chadwhitacre commented Nov 9, 2021

chadwhitacre commented Nov 22, 2021

MarErm27 commented Nov 23, 2021

MarErm27 commented Nov 23, 2021 • edited Loading

MarErm27 commented Nov 23, 2021 • edited Loading

chadwhitacre commented Nov 23, 2021

chadwhitacre commented Nov 23, 2021

chadwhitacre commented Nov 23, 2021

relaxolotl commented Nov 24, 2021

relaxolotl commented Nov 24, 2021 • edited Loading

renchap commented Nov 28, 2021 • edited Loading

flub commented Nov 30, 2021

relaxolotl commented Nov 30, 2021

relaxolotl commented Nov 30, 2021

renchap commented Nov 30, 2021 • edited Loading

chadwhitacre commented Nov 30, 2021 • edited Loading

renchap commented Dec 2, 2021

renchap commented Dec 8, 2021

chadwhitacre commented Dec 13, 2021

relaxolotl commented Dec 15, 2021

chadwhitacre commented Dec 15, 2021 • edited Loading

renchap commented Dec 16, 2021

renchap commented Dec 17, 2021

chadwhitacre commented Dec 20, 2021

MarErm27 commented Oct 30, 2021 •

edited by chadwhitacre

Loading

MarErm27 commented Nov 23, 2021 •

edited

Loading

MarErm27 commented Nov 23, 2021 •

edited

Loading

relaxolotl commented Nov 24, 2021 •

edited

Loading

renchap commented Nov 28, 2021 •

edited

Loading

renchap commented Nov 30, 2021 •

edited

Loading

chadwhitacre commented Nov 30, 2021 •

edited

Loading

chadwhitacre commented Dec 15, 2021 •

edited

Loading