Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Batches, Retry and unique jobs #3312

Closed
raivil opened this issue Jan 11, 2017 · 16 comments
Closed

Question about Batches, Retry and unique jobs #3312

raivil opened this issue Jan 11, 2017 · 16 comments

Comments

@raivil
Copy link

raivil commented Jan 11, 2017

Ruby version: 2.2.5
Sidekiq / Pro / Enterprise version(s): Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2

Hi,

I have a batch that run jobs and when one of these jobs fail, it gets re enqueued, but ends up with the same job being (re)executed several times and at the same time. Sub jobs have unique_for: 24.hours but it seems that this config is not being respected.

Is this an expected behavior?

Thank you.

Below is a sample code for the batch job and sub jobs.

class BatchJob
  include Sidekiq::Worker

  sidekiq_options queue: :default, unique_for: 1.hour

  def perform(id)
    model = ::Model.find(id)
    batch = Sidekiq::Batch.new
    batch.description = "#{self.class.name}"
    batch.on(:success, self.class, "model_id" => id)
    batch.jobs do
      model.sub_models.each do |sub_model|
        SubJob.perform_async(sub_model.id)
      end
    end
  end

  def on_success(_status, options)
    ::Model.find(options["model_id"])
    // mark as success.
  end
end
class SubJob
  include Sidekiq::Worker
  sidekiq_options queue: :default, unique_for: 24.hours

  def perform(id)
    sub_model = SubModel.find(id)
    sub_model.processing_task
  end
end

Sidekiq initializer have the following lines, among others.

Sidekiq.default_worker_options = { "backtrace" => true }
Sidekiq::Client.reliable_push! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Client
Sidekiq::Enterprise.unique! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Ent-Unique-Jobs
@mperham
Copy link
Collaborator

mperham commented Jan 11, 2017

  • Could you give your full initializer? Specifically are you using any special fetch scheme?
  • Are the duplicate jobs being created any time there is a job error or only upon deployment?
  • You have sample code here but to easily track down the issue I'd need a working app which reproduces the problem.

@raivil
Copy link
Author

raivil commented Jan 11, 2017

  • I can't get the full initializer, but i'm using timed_fetch for reliability:
Sidekiq.configure_server do |config|
...
  config.timed_fetch! # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server
  config.reliable_scheduler! # https://github.com/mperham/sidekiq/wiki/Reliability#scheduler
end
  • Duplicated jobs are being create anytime there's an error, not only on deployment.
  • I'll create a new app to reproduce the issue and let you know.

@raivil
Copy link
Author

raivil commented Jan 13, 2017

Hi Mike,

I made this app to try to replicate the issue https://github.com/raivil/sidekiq_issue_3312.
Please check the README with instructions to run the batch. Let me know if I missed anything.
I added a full initializer as well.

This issue is very hard to replicate and I wasn't able to do it consistently.
I got errors in two situations:
1 - On initial execution, after enqueueing the batch, all 20 jobs from that batch fails and on next retries, I had 26 jobs running, where I should have only 20.
2 - Left it running overnight and after a few retries, I had 22 jobs on retry queue, as shown on screenshot below.

screen shot 2017-01-12 at 17 20 43

Let me know if you have any questions or I can do anything else to help mitigating this issue.
Thank you.

@mperham
Copy link
Collaborator

mperham commented Jan 13, 2017

What is the purpose of the RETAINED array in the subjob? That's a memory leak and out of memory errors can cause all sorts of unpredictable behavior.

@mperham
Copy link
Collaborator

mperham commented Jan 13, 2017

Thanks for the app. You still need to give me a step-by-step list of commands to reproduce the scenario. How do I create the models? How do I create the batch job?

@raivil
Copy link
Author

raivil commented Jan 18, 2017

Mike,
Since this issue is very hard o replicate and using a array to increase a memory for the job was one way I could find to simulate it.

I've added the necessary commands to run here on the README file.

I'm still investigating the issue. Thank you for helping.

@mperham
Copy link
Collaborator

mperham commented Jan 20, 2017

I followed your directions. 30MB is too small for the memory killer, it kills the child Sidekiqs before they have a chance to actually process any jobs. Raising it to 60MB and I got expected results. 60 failures with 20 retries (i.e. the 20 jobs had retried three times already). Make sure you are using the reliable scheduler to ensure you don't lose jobs.

https://github.com/mperham/sidekiq/wiki/Reliability#scheduler

@raivil
Copy link
Author

raivil commented Jan 30, 2017

Mike
Thanks for testing.
I'm in deed using the reliable scheduler, but still having problems on our production environment.
Reproduction locally is very hard and intermittent.

What i observed here it that this issue only seems to happen mainly with long running jobs.

I'll continue to work on it and try to reproduce the error in a consistent way.

@kroehre
Copy link

kroehre commented Jan 31, 2017

@raivil, @mperham we are seeing this issue as well. We are on:

sidekiq (4.2.4)
sidekiq-ent (1.3.2)
sidekiq-pro (3.3.2)

We have a long-running job (10-20 minutes) that we think is getting killed and restarts in duplicate (2-3 jobs, 1 sharing the jid of the job that stopped). It is enqueued approximately every 15 mintues. The job has sidekiq_options retry: false, unique_for: 20.minutes but we end up sometimes having 10+ running at once.

Our current theory is that sidekiqswarm is killing the jobs due to memory and retrying them in duplicate but we are only able to observe this happening in production so we have been unable to validate the theory.

@mperham
Copy link
Collaborator

mperham commented Jan 31, 2017

@kroehre Forgive me if I'm misunderstanding but if you want something to constantly be running, you should start your own Thread within Sidekiq or run your own process. Sidekiq is designed for lots of small jobs, not one job in a constant loop. The pain you are feeling is because you are pushing against that design.

@kroehre
Copy link

kroehre commented Jan 31, 2017

@mperham that's fair, and we are evaluating breaking up the job or finding a different solution. Just wanted to chime in on the issue since we appear to be observing the same or a very similar problem.

@mperham mperham closed this as completed Feb 7, 2017
@raivil
Copy link
Author

raivil commented Feb 9, 2017

@mperham

Our jobs process both large and small files and we've seen this issue happening with both kinds of file.
Having several repeated jobs with same jid/class/parameters after a failure or restart suggests a possible issue with re-enqueueing jobs that should be unique for an amount of time as defined on unique_for.

Furthermore, today I was able to get some more evidences as this issue occurred again.

This is what happened:

  • Batch with 5 jobs.
  • 3 jobs successfully executed.
  • New Deployment
  • Batch Failures go from 0 to 2. Reason: Sidekiq::Shutdown
  • Workers restart (2017-02-09T19:42:00Z)
  • 2 Pending jobs start again. Start time 2017-02-09T19:42:00Z.
  • 2 pending jobs running for 1 hour.
  • 2 new jobs enqueue and running, with same JID/params/class of executing jobs. No exceptions, no errors.
  • These 2 duplicated jobs were not enqueued by app.
  • New duplicated jobs were enqueue exactly 1 hour after. Duplicated jobs started at 2017-02-09T20:42:00Z
  • Log showing new jobs start:
[TID-1rdmuw JID-c0cdb0b6c7c6ab075bc20f96 MyJob] Start: 2017-02-09T20:42:00Z: [{"class"=>"MyJob", "args"=>[14590], "retry"=>true, "queue"=>"medium", "backtrace"=>true, "unique_for"=>86400, "jid"=>"c0cdb0b6c7c6ab075bc20f96", "created_at"=>1486665172.9725857, "bid"=>"mO_YyJtzld7vUQ", "unique_token"=>"ecca564f8dd70b31c3412e82131965ac342d3929", "unlocks_at"=>"1486751572.9726343", "enqueued_at"=>1486665172.9746706}]

[TID-1redsw JID-7bc6b7ca307d26ea21f030e0 MyJob] Start: 2017-02-09T20:42:00Z: [{"class"=>"MyJob", "args"=>[14591], "retry"=>true, "queue"=>"medium", "backtrace"=>true, "unique_for"=>86400, "jid"=>"7bc6b7ca307d26ea21f030e0", "created_at"=>1486665172.973353, "bid"=>"mO_YyJtzld7vUQ", "unique_token"=>"ccd84a56addb1837de693cd490d9b407c7e4b1fa", "unlocks_at"=>"1486751572.973372", "enqueued_at"=>1486665172.9747047}]
  • Workers still running. No restarts.

Update:

  • 2 duplicating jobs were enqueued again, totaling 6 jobs. Enqueue time: 2017-02-09T21:42:00Z
  • One of the jobs finished. Job count 5. Batch pending count updated to 1.
  • 1 new job was re-enqueued. Duplicate of the job pending in the batch. Start time: 2017-02-09T22:42:00Z

@mperham
Copy link
Collaborator

mperham commented Feb 11, 2017

Looking at 7bc6b7ca307d26ea21f030e0, the times look consistent:

  • created_at 2017-02-09 18:32:52 UTC
  • enqueued_at 2017-02-09 18:32:52 UTC
  • restart 2017-02-09T20:42:00Z
  • unlocks_at 2017-02-10 18:32:52 UTC

Could you show me the deployment code? How do you quiet/restart Sidekiq?

@raivil
Copy link
Author

raivil commented Feb 13, 2017

@mperham
Foreman is used to export upstart scripts that will handle starts/stops.
Upstart will kill main sidekiq process with sigterm.

start on starting my_services-worker
stop on stopping my_services-worker
respawn

exec su - myuser --session-command 'cd /my-app/releases/20170200000000; export PORT=5000; RAILS_ENV=$RAILS_ENV MAXMEM_KB=$MAXMEM bundle exec sidekiqswarm -q critical -q default -q medium -q low >> /my-app/shared/log/worker-1.log 2>&1'

My biggest concern is why the jobs were re-enqueued in one hour intervals (2017-02-09T20:42:00Z, 2017-02-09T21:42:00Z, 2017-02-09T22:42:00Z) while original job was still executing.

@mperham
Copy link
Collaborator

mperham commented Feb 13, 2017

timed_fetch will re-enqueue jobs which have not finished within one hour, as noted in the cons of the wiki page:

https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server#timed_fetch

Switch to super_fetch.

@raivil
Copy link
Author

raivil commented Feb 16, 2017

Thanks Mike.
I'll check super_fetch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants