-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Batches, Retry and unique jobs #3312
Comments
|
|
Hi Mike, I made this app to try to replicate the issue https://github.com/raivil/sidekiq_issue_3312. This issue is very hard to replicate and I wasn't able to do it consistently. Let me know if you have any questions or I can do anything else to help mitigating this issue. |
What is the purpose of the RETAINED array in the subjob? That's a memory leak and out of memory errors can cause all sorts of unpredictable behavior. |
Thanks for the app. You still need to give me a step-by-step list of commands to reproduce the scenario. How do I create the models? How do I create the batch job? |
Mike, I've added the necessary commands to run here on the README file. I'm still investigating the issue. Thank you for helping. |
I followed your directions. 30MB is too small for the memory killer, it kills the child Sidekiqs before they have a chance to actually process any jobs. Raising it to 60MB and I got expected results. 60 failures with 20 retries (i.e. the 20 jobs had retried three times already). Make sure you are using the reliable scheduler to ensure you don't lose jobs. https://github.com/mperham/sidekiq/wiki/Reliability#scheduler |
Mike What i observed here it that this issue only seems to happen mainly with long running jobs. I'll continue to work on it and try to reproduce the error in a consistent way. |
@raivil, @mperham we are seeing this issue as well. We are on: sidekiq (4.2.4) We have a long-running job (10-20 minutes) that we think is getting killed and restarts in duplicate (2-3 jobs, 1 sharing the jid of the job that stopped). It is enqueued approximately every 15 mintues. The job has Our current theory is that sidekiqswarm is killing the jobs due to memory and retrying them in duplicate but we are only able to observe this happening in production so we have been unable to validate the theory. |
@kroehre Forgive me if I'm misunderstanding but if you want something to constantly be running, you should start your own Thread within Sidekiq or run your own process. Sidekiq is designed for lots of small jobs, not one job in a constant loop. The pain you are feeling is because you are pushing against that design. |
@mperham that's fair, and we are evaluating breaking up the job or finding a different solution. Just wanted to chime in on the issue since we appear to be observing the same or a very similar problem. |
Our jobs process both large and small files and we've seen this issue happening with both kinds of file. Furthermore, today I was able to get some more evidences as this issue occurred again. This is what happened:
Update:
|
Looking at 7bc6b7ca307d26ea21f030e0, the times look consistent:
Could you show me the deployment code? How do you quiet/restart Sidekiq? |
@mperham
My biggest concern is why the jobs were re-enqueued in one hour intervals ( |
timed_fetch will re-enqueue jobs which have not finished within one hour, as noted in the cons of the wiki page: https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server#timed_fetch Switch to super_fetch. |
Thanks Mike. |
Ruby version: 2.2.5
Sidekiq / Pro / Enterprise version(s): Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2
Hi,
I have a batch that run jobs and when one of these jobs fail, it gets re enqueued, but ends up with the same job being (re)executed several times and at the same time. Sub jobs have
unique_for: 24.hours
but it seems that this config is not being respected.Is this an expected behavior?
Thank you.
Below is a sample code for the batch job and sub jobs.
Sidekiq initializer have the following lines, among others.
The text was updated successfully, but these errors were encountered: