Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131

Closed
MarErm27 opened this issue Oct 30, 2021 · 26 comments
Closed

Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131

MarErm27 opened this issue Oct 30, 2021 · 26 comments

Comments

@MarErm27
Copy link

MarErm27 commented Oct 30, 2021

Symptom 1

Version

21.10.0

Steps to Reproduce

  1. Install 21.9.0
  2. Collect CPU metrics
  3. Upgrade to 21.10.0
  4. Notice an increased CPU usage

Expected Result

CPU usage remains the same

Actual Result

CPU usage has increased which caused AWS CPU credits balance to drop.
sentry-credits-usage

Symptom 2

Version

21.11.0.dev0

Steps to Reproduce

Sentry worked for a week, but now it started to spam on everyone's emails with internal issues. How to fix them?

Issues

Expected Result

No internal issues were expected

Actual Result

Internal issues spam on everyone's emails

@chadwhitacre
Copy link
Member

Looks like you lost your connection to Clickhouse. Why is that? Is this still ongoing?

@MarErm27
Copy link
Author

MarErm27 commented Nov 8, 2021

Hello, @chadwhitacre !
I noticed that some issues don't repeat and some are getting repeated constantly:
image
The one that is getting repeated, called "SoftTimeLimitExceeded()"
image

It seems that it's releted to Redis. Here are Redis logs, I don't see any issues here
image

By the way, I'm new to Sentry, where I can find information what for Sentry uses ClickHouse and Redis and other containers?

@chadwhitacre
Copy link
Member

information what for Sentry uses ClickHouse and Redis and other containers?

Sounds like a +1 for #789. 😬

@chadwhitacre
Copy link
Member

SoftTimeLimitExceeded

#1165 (comment) suggests this is due to underprovisioning.

@MarErm27
Copy link
Author

SoftTimeLimitExceeded

#1165 (comment) suggests this is due to underprovisioning.

Hello, @chadwhitacre, thank you for your reply!

image

@MarErm27
Copy link
Author

MarErm27 commented Nov 23, 2021

@chadwhitacre
I just noticed that 6 days ago it stopped working. No new issues have been recorded in any project after that
image

@MarErm27
Copy link
Author

MarErm27 commented Nov 23, 2021

@chadwhitacre I restarted everything and it works again now. How to avoid this in future? Sentry silently stopped storing errors

@chadwhitacre
Copy link
Member

How to avoid this in future?

Have you considered migrating to SaaS? :)

@chadwhitacre chadwhitacre changed the title Internal issues Seeing SoftTimeLimitExceeded repeatedly Nov 23, 2021
@chadwhitacre
Copy link
Member

Possibly related to #1141? 🤔

@chadwhitacre
Copy link
Member

From @renchap on #1141:

I checked my internal project and we also have those SoftTimeLimitExceeded errors, and they seem to correlate with the CPU spikes.
It looks like a job is launched periodically, uses all the CPU and then gets terminated with SoftTimeLimitExceeded.

Folding that into this, then.

@chadwhitacre chadwhitacre changed the title Seeing SoftTimeLimitExceeded repeatedly Periodic job soaks CPU, dies with SoftTimeLimitExceeded Nov 23, 2021
@chadwhitacre chadwhitacre pinned this issue Nov 23, 2021
@relaxolotl
Copy link

Thanks for the detailed report! I'm currently taking a look at what's happening, will follow-up with some more info and possibly questions once I'm properly caught up on the details.

In the meanwhile, it looks like this is related to something that we've added in recently: sentry writes to a redis store with some basic measurements which help the processing pipeline in applying some backpressure when certain projects are behaving poorly. The timeouts are related to that redis store, which based on the volume of errors you're seeing is simply not responding at all.

@relaxolotl
Copy link

relaxolotl commented Nov 24, 2021

It looks like you've bumped into a bit of a bug that was indeed introduced in 21.10.0 and 21.11.0, my apologies. If you (or other users) are still seeing these errors after a restart, a quick fix would be to add

del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"]

to your sentry/sentry.conf.py, and restarting. This should de-register the task that's producing all of these errors. If this isn't working, please let me know and we can try to figure something out. I'll work on a long-term fix for this in the meanwhile.

As @chadwhitacre has noted above, #1165 (comment) 's observation that this may be caused by underprovisioning is a possible root cause for this behaviour. The operation that's timing out is expected to operate on a very small data set, hence its extremely short timeout. Another possible issue is that some setup related to the redis store was missing which was repaired by a restart.

Regardless of the amount of resources allocated to sentry, the feature that's spitting out these errors needs to be disabled for on-premise self-hosted configurations given the scale they're expected to operate at. Simple, minimal setups have no need for the feature that's triggering these issues.

As mentioned above, I'll be following up on this with a fix on sentry itself which should disable this feature entirely for self-hosted users. I'll follow-up on this issue once that's complete, which should also include a tentative version number that contains the fix.

@renchap
Copy link
Contributor

renchap commented Nov 28, 2021

@relaxolotl I applied your change and I no longer see the CPU spikes nor the Python exceptions, but the CPU usage is much higher all the time now:

image

Over the last month:
image

First increase is the update to 21.10.0, and the second one is the configuration change.

I am also seeing the disk usage increasing a lot since the update to 21.10.0, I dont know if this related:
image

I am also seeing a lot of those errors in the logs:

cron_1                                      | celery beat v4.4.7 (cliffs) is starting.
cron_1                                      | __    -    ... __   -        _
cron_1                                      | LocalTime -> 2021-11-28 20:50:58
cron_1                                      | Configuration ->
cron_1                                      |     . broker -> redis://redis:6379/0
cron_1                                      |     . loader -> celery.loaders.app.AppLoader
cron_1                                      |     . scheduler -> celery.beat.PersistentScheduler
cron_1                                      |     . db -> /tmp/sentry-celerybeat
cron_1                                      |     . logfile -> [stderr]@%WARNING
cron_1                                      |     . maxinterval -> 5.00 minutes (300s)
cron_1                                      | Traceback (most recent call last):
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/kombu/utils/objects.py", line 42, in __get__
cron_1                                      |     return obj.__dict__[self.__name__]
cron_1                                      | KeyError: 'scheduler'
cron_1                                      |
cron_1                                      | During handling of the above exception, another exception occurred:
cron_1                                      |
cron_1                                      | Traceback (most recent call last):
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/apps/beat.py", line 109, in start_scheduler
cron_1                                      |     service.start()
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 622, in start
cron_1                                      |     humanize_seconds(self.scheduler.max_interval))
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/kombu/utils/objects.py", line 44, in __get__
cron_1                                      |     value = obj.__dict__[self.__name__] = self.__get(obj)
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 666, in scheduler
cron_1                                      |     return self.get_scheduler()
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 657, in get_scheduler
cron_1                                      |     return symbol_by_name(self.scheduler_cls, aliases=aliases)(
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 501, in __init__
cron_1                                      |     Scheduler.__init__(self, *args, **kwargs)
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 257, in __init__
cron_1                                      |     self.setup_schedule()
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 544, in setup_schedule
cron_1                                      |     self.merge_inplace(self.app.conf.beat_schedule)
cron_1                                      |   File "/usr/local/lib/python3.8/site-packages/celery/beat.py", line 454, in merge_inplace
cron_1                                      |     entry = self.Entry(**dict(b[key], name=key, app=self.app))
cron_1                                      | TypeError: 'NoneType' object is not iterable
cron_1                                      | 20:50:58 [CRITICAL] celery.beat: beat raised exception <class 'TypeError'>: TypeError("'NoneType' object is not iterable")

There have been not major changes in usage or trafic since the update.

@flub
Copy link

flub commented Nov 30, 2021

It looks like you've bumped into a bit of a bug that was indeed introduced in 21.10.0 and 21.11.0, my apologies. If you (or other users) are still seeing these errors after a restart, a quick fix would be to add

CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"] = None

I wonder if this would have been better to do:

del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"]

as a hotfix instead and whether that is still the reason for your traceback. not sure if this would affect cpu or not

@relaxolotl
Copy link

Oh darn, that actually explains the errors in the logs. My apologies @renchap, to reiterate what @flub said could you replace

CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"] = None

with

del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"]

?

I'll update my original comment to reflect this. The del should remove the errors related to TypeError: 'NoneType' object is not iterable.

@relaxolotl
Copy link

As a general heads up, if no errors are being spit out related to Celery/Celerybeat after the hotfix is applied but CPU usage is still increased, it'll be a little bit more difficult to investigate the root cause of the increased CPU usage due to the nature of self-hosted releases. This is because self-hosted releases batch all changes to sentry made on a monthly basis (e.g. 21.10.0 has all of the changes from mid-Sept to mid-Oct, see https://develop.sentry.dev/self-hosted/releases/)

In other words, I may need to pull in somebody else or other people to narrow things down because there's so much bundled into every release (see https://github.com/getsentry/sentry/releases/tag/21.10.0).

cc @chadwhitacre in case I've gotten any details wrong

@renchap
Copy link
Contributor

renchap commented Nov 30, 2021

Thanks @relaxolotl and @flub, it works 👍

It is a bit too early to tell, but it looks like the CPU usage also dropped to the previous levels

image

(the last spike is a full restart to ensure all was correctly deployed)

I will have a look tomorrow with a few hours of data.

@chadwhitacre
Copy link
Member

chadwhitacre commented Nov 30, 2021

Love the progress here! 👏

@relaxolotl We actually have quite fine-grained versioning for self-hosted in addition to the monthly CalVer releases. Cloning the self-hosted repo and running install.sh will get the very latest version. Ask away if you need more debugging info! :)

@renchap
Copy link
Contributor

renchap commented Dec 2, 2021

So after a few days of data, I can confirm that:

  • the change in sentry.py no longer causes a large increase in average CPU
  • but version 21.10.0 still seems to use more CPU than 21.9.0
  • and we still have CPU spikes periodically, probably caused by a periodic job, which are triggering our monitoring as they use close to full CPU available. This did not happen with 21.9.0

Last month CPU usage:
image
(upgrade to 21.10.0 was on the 18th, invalid change in sentry.py on the 28th, correct change in sentry.py on the 30th.

Last 4 hours CPU usage:
image

@renchap
Copy link
Contributor

renchap commented Dec 8, 2021

Any news on this?

@chadwhitacre
Copy link
Member

@relaxolotl I think we have some changes that have landed in the nightlies by now, yes? New monthly release is scheduled for the day after tomorrow.

@chadwhitacre chadwhitacre mentioned this issue Dec 15, 2021
3 tasks
@relaxolotl
Copy link

Yep, there're changes in the nightlies (and just tagged release) which should improve things out of the box. If the CPU spikes are showing up despite the aforementioned task being disabled and especially if it still happens as of the most recent monthly release, the root cause may be out of my purview; I'm afraid I'm not too familiar with most of the other changes on sentry that could be causing this. What do you think of possibly bringing this up with the backend team(s) internally @chadwhitacre?

@chadwhitacre
Copy link
Member

chadwhitacre commented Dec 15, 2021

@renchap What version are you currently running? If you're not on a nightly (i.e., latest master) then let's have you test out 21.12.0 once that goes out tomorrow.

@relaxolotl If we hear back from @renchap that this isn't resolved, then I agree, let's put our heads together internally and see what we can come up with for next steps.

Thanks, gang!

@renchap
Copy link
Contributor

renchap commented Dec 16, 2021

I just updated to 21.12.0, we will see how it goes.

As a side note, you need to remove the del CELERYBEAT_SCHEDULE["check-symbolicator-lpq-project-eligibility"] line in your sentry.conf.py before upgrading, otherwise it will fail.

@renchap
Copy link
Contributor

renchap commented Dec 17, 2021

After 1 day, it seems that the CPU usage is back to what it was with 21.9.0 on my instance.

@MarErm27 can you update to 21.12.0 and see if you still see the issue on yours?

@chadwhitacre
Copy link
Member

Good news, @renchap!

I'm going to go ahead and close this out on the strength of your testimony. If someone else wants to chime in and we need to reopen we can do so.

Thanks, everyone!

@chadwhitacre chadwhitacre unpinned this issue Dec 20, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Jan 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants