-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic job soaks CPU, dies with SoftTimeLimitExceeded #1131
Comments
Looks like you lost your connection to Clickhouse. Why is that? Is this still ongoing? |
Hello, @chadwhitacre ! It seems that it's releted to Redis. Here are Redis logs, I don't see any issues here By the way, I'm new to Sentry, where I can find information what for Sentry uses ClickHouse and Redis and other containers? |
Sounds like a +1 for #789. 😬 |
#1165 (comment) suggests this is due to underprovisioning. |
Hello, @chadwhitacre, thank you for your reply! |
@chadwhitacre |
@chadwhitacre I restarted everything and it works again now. How to avoid this in future? Sentry silently stopped storing errors |
Have you considered migrating to SaaS? :) |
Possibly related to #1141? 🤔 |
Folding that into this, then. |
Thanks for the detailed report! I'm currently taking a look at what's happening, will follow-up with some more info and possibly questions once I'm properly caught up on the details. In the meanwhile, it looks like this is related to something that we've added in recently: sentry writes to a redis store with some basic measurements which help the processing pipeline in applying some backpressure when certain projects are behaving poorly. The timeouts are related to that redis store, which based on the volume of errors you're seeing is simply not responding at all. |
It looks like you've bumped into a bit of a bug that was indeed introduced in 21.10.0 and 21.11.0, my apologies. If you (or other users) are still seeing these errors after a restart, a quick fix would be to add
to your As @chadwhitacre has noted above, #1165 (comment) 's observation that this may be caused by underprovisioning is a possible root cause for this behaviour. The operation that's timing out is expected to operate on a very small data set, hence its extremely short timeout. Another possible issue is that some setup related to the redis store was missing which was repaired by a restart. Regardless of the amount of resources allocated to sentry, the feature that's spitting out these errors needs to be disabled for As mentioned above, I'll be following up on this with a fix on sentry itself which should disable this feature entirely for self-hosted users. I'll follow-up on this issue once that's complete, which should also include a tentative version number that contains the fix. |
@relaxolotl I applied your change and I no longer see the CPU spikes nor the Python exceptions, but the CPU usage is much higher all the time now: First increase is the update to 21.10.0, and the second one is the configuration change. I am also seeing the disk usage increasing a lot since the update to 21.10.0, I dont know if this related: I am also seeing a lot of those errors in the logs:
There have been not major changes in usage or trafic since the update. |
I wonder if this would have been better to do:
as a hotfix instead and whether that is still the reason for your traceback. not sure if this would affect cpu or not |
Oh darn, that actually explains the errors in the logs. My apologies @renchap, to reiterate what @flub said could you replace
with
? I'll update my original comment to reflect this. The |
As a general heads up, if no errors are being spit out related to Celery/Celerybeat after the hotfix is applied but CPU usage is still increased, it'll be a little bit more difficult to investigate the root cause of the increased CPU usage due to the nature of self-hosted releases. This is because self-hosted releases batch all changes to sentry made on a monthly basis (e.g. 21.10.0 has all of the changes from mid-Sept to mid-Oct, see https://develop.sentry.dev/self-hosted/releases/) In other words, I may need to pull in somebody else or other people to narrow things down because there's so much bundled into every release (see https://github.com/getsentry/sentry/releases/tag/21.10.0). cc @chadwhitacre in case I've gotten any details wrong |
Thanks @relaxolotl and @flub, it works 👍 It is a bit too early to tell, but it looks like the CPU usage also dropped to the previous levels (the last spike is a full restart to ensure all was correctly deployed) I will have a look tomorrow with a few hours of data. |
Love the progress here! 👏 @relaxolotl We actually have quite fine-grained versioning for |
Any news on this? |
@relaxolotl I think we have some changes that have landed in the nightlies by now, yes? New monthly release is scheduled for the day after tomorrow. |
Yep, there're changes in the nightlies (and just tagged release) which should improve things out of the box. If the CPU spikes are showing up despite the aforementioned task being disabled and especially if it still happens as of the most recent monthly release, the root cause may be out of my purview; I'm afraid I'm not too familiar with most of the other changes on sentry that could be causing this. What do you think of possibly bringing this up with the backend team(s) internally @chadwhitacre? |
@renchap What version are you currently running? If you're not on a nightly (i.e., latest @relaxolotl If we hear back from @renchap that this isn't resolved, then I agree, let's put our heads together internally and see what we can come up with for next steps. Thanks, gang! |
I just updated to 21.12.0, we will see how it goes. As a side note, you need to remove the |
After 1 day, it seems that the CPU usage is back to what it was with 21.9.0 on my instance. @MarErm27 can you update to 21.12.0 and see if you still see the issue on yours? |
Good news, @renchap! I'm going to go ahead and close this out on the strength of your testimony. If someone else wants to chime in and we need to reopen we can do so. Thanks, everyone! |
Symptom 1
Version
21.10.0
Steps to Reproduce
Expected Result
CPU usage remains the same
Actual Result
CPU usage has increased which caused AWS CPU credits balance to drop.
Symptom 2
Version
21.11.0.dev0
Steps to Reproduce
Sentry worked for a week, but now it started to spam on everyone's emails with internal issues. How to fix them?
Expected Result
No internal issues were expected
Actual Result
Internal issues spam on everyone's emails
The text was updated successfully, but these errors were encountered: