Question about Batches, Retry and unique jobs #3312

raivil · 2017-01-11T13:51:56Z

Ruby version: 2.2.5
Sidekiq / Pro / Enterprise version(s): Sidekiq 4.1.3 / Pro 3.3.2 / Ent 1.3.2

Hi,

I have a batch that run jobs and when one of these jobs fail, it gets re enqueued, but ends up with the same job being (re)executed several times and at the same time. Sub jobs have unique_for: 24.hours but it seems that this config is not being respected.

Is this an expected behavior?

Thank you.

Below is a sample code for the batch job and sub jobs.

class BatchJob
  include Sidekiq::Worker

  sidekiq_options queue: :default, unique_for: 1.hour

  def perform(id)
    model = ::Model.find(id)
    batch = Sidekiq::Batch.new
    batch.description = "#{self.class.name}"
    batch.on(:success, self.class, "model_id" => id)
    batch.jobs do
      model.sub_models.each do |sub_model|
        SubJob.perform_async(sub_model.id)
      end
    end
  end

  def on_success(_status, options)
    ::Model.find(options["model_id"])
    // mark as success.
  end
end

class SubJob
  include Sidekiq::Worker
  sidekiq_options queue: :default, unique_for: 24.hours

  def perform(id)
    sub_model = SubModel.find(id)
    sub_model.processing_task
  end
end

Sidekiq initializer have the following lines, among others.

Sidekiq.default_worker_options = { "backtrace" => true }
Sidekiq::Client.reliable_push! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Client
Sidekiq::Enterprise.unique! unless Rails.env.test? # https://github.com/mperham/sidekiq/wiki/Ent-Unique-Jobs

The text was updated successfully, but these errors were encountered:

mperham · 2017-01-11T17:12:01Z

Could you give your full initializer? Specifically are you using any special fetch scheme?
Are the duplicate jobs being created any time there is a job error or only upon deployment?
You have sample code here but to easily track down the issue I'd need a working app which reproduces the problem.

raivil · 2017-01-11T17:39:44Z

I can't get the full initializer, but i'm using timed_fetch for reliability:

Sidekiq.configure_server do |config|
...
  config.timed_fetch! # https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server
  config.reliable_scheduler! # https://github.com/mperham/sidekiq/wiki/Reliability#scheduler
end

Duplicated jobs are being create anytime there's an error, not only on deployment.
I'll create a new app to reproduce the issue and let you know.

raivil · 2017-01-13T15:57:13Z

Hi Mike,

I made this app to try to replicate the issue https://github.com/raivil/sidekiq_issue_3312.
Please check the README with instructions to run the batch. Let me know if I missed anything.
I added a full initializer as well.

This issue is very hard to replicate and I wasn't able to do it consistently.
I got errors in two situations:
1 - On initial execution, after enqueueing the batch, all 20 jobs from that batch fails and on next retries, I had 26 jobs running, where I should have only 20.
2 - Left it running overnight and after a few retries, I had 22 jobs on retry queue, as shown on screenshot below.

Let me know if you have any questions or I can do anything else to help mitigating this issue.
Thank you.

mperham · 2017-01-13T17:44:29Z

What is the purpose of the RETAINED array in the subjob? That's a memory leak and out of memory errors can cause all sorts of unpredictable behavior.

mperham · 2017-01-13T23:37:12Z

Thanks for the app. You still need to give me a step-by-step list of commands to reproduce the scenario. How do I create the models? How do I create the batch job?

raivil · 2017-01-18T12:18:45Z

Mike,
Since this issue is very hard o replicate and using a array to increase a memory for the job was one way I could find to simulate it.

I've added the necessary commands to run here on the README file.

I'm still investigating the issue. Thank you for helping.

mperham · 2017-01-20T19:39:33Z

I followed your directions. 30MB is too small for the memory killer, it kills the child Sidekiqs before they have a chance to actually process any jobs. Raising it to 60MB and I got expected results. 60 failures with 20 retries (i.e. the 20 jobs had retried three times already). Make sure you are using the reliable scheduler to ensure you don't lose jobs.

https://github.com/mperham/sidekiq/wiki/Reliability#scheduler

raivil · 2017-01-30T16:31:08Z

Mike
Thanks for testing.
I'm in deed using the reliable scheduler, but still having problems on our production environment.
Reproduction locally is very hard and intermittent.

What i observed here it that this issue only seems to happen mainly with long running jobs.

I'll continue to work on it and try to reproduce the error in a consistent way.

kroehre · 2017-01-31T00:38:41Z

@raivil, @mperham we are seeing this issue as well. We are on:

sidekiq (4.2.4)
sidekiq-ent (1.3.2)
sidekiq-pro (3.3.2)

We have a long-running job (10-20 minutes) that we think is getting killed and restarts in duplicate (2-3 jobs, 1 sharing the jid of the job that stopped). It is enqueued approximately every 15 mintues. The job has sidekiq_options retry: false, unique_for: 20.minutes but we end up sometimes having 10+ running at once.

Our current theory is that sidekiqswarm is killing the jobs due to memory and retrying them in duplicate but we are only able to observe this happening in production so we have been unable to validate the theory.

mperham · 2017-01-31T18:59:26Z

@kroehre Forgive me if I'm misunderstanding but if you want something to constantly be running, you should start your own Thread within Sidekiq or run your own process. Sidekiq is designed for lots of small jobs, not one job in a constant loop. The pain you are feeling is because you are pushing against that design.

kroehre · 2017-01-31T21:47:26Z

@mperham that's fair, and we are evaluating breaking up the job or finding a different solution. Just wanted to chime in on the issue since we appear to be observing the same or a very similar problem.

raivil · 2017-02-09T21:31:44Z

@mperham

Our jobs process both large and small files and we've seen this issue happening with both kinds of file.
Having several repeated jobs with same jid/class/parameters after a failure or restart suggests a possible issue with re-enqueueing jobs that should be unique for an amount of time as defined on unique_for.

Furthermore, today I was able to get some more evidences as this issue occurred again.

This is what happened:

Batch with 5 jobs.
3 jobs successfully executed.
New Deployment
Batch Failures go from 0 to 2. Reason: Sidekiq::Shutdown
Workers restart (2017-02-09T19:42:00Z)
2 Pending jobs start again. Start time 2017-02-09T19:42:00Z.
2 pending jobs running for 1 hour.
2 new jobs enqueue and running, with same JID/params/class of executing jobs. No exceptions, no errors.
These 2 duplicated jobs were not enqueued by app.
New duplicated jobs were enqueue exactly 1 hour after. Duplicated jobs started at 2017-02-09T20:42:00Z
Log showing new jobs start:

[TID-1rdmuw JID-c0cdb0b6c7c6ab075bc20f96 MyJob] Start: 2017-02-09T20:42:00Z: [{"class"=>"MyJob", "args"=>[14590], "retry"=>true, "queue"=>"medium", "backtrace"=>true, "unique_for"=>86400, "jid"=>"c0cdb0b6c7c6ab075bc20f96", "created_at"=>1486665172.9725857, "bid"=>"mO_YyJtzld7vUQ", "unique_token"=>"ecca564f8dd70b31c3412e82131965ac342d3929", "unlocks_at"=>"1486751572.9726343", "enqueued_at"=>1486665172.9746706}]

[TID-1redsw JID-7bc6b7ca307d26ea21f030e0 MyJob] Start: 2017-02-09T20:42:00Z: [{"class"=>"MyJob", "args"=>[14591], "retry"=>true, "queue"=>"medium", "backtrace"=>true, "unique_for"=>86400, "jid"=>"7bc6b7ca307d26ea21f030e0", "created_at"=>1486665172.973353, "bid"=>"mO_YyJtzld7vUQ", "unique_token"=>"ccd84a56addb1837de693cd490d9b407c7e4b1fa", "unlocks_at"=>"1486751572.973372", "enqueued_at"=>1486665172.9747047}]

Workers still running. No restarts.

Update:

2 duplicating jobs were enqueued again, totaling 6 jobs. Enqueue time: 2017-02-09T21:42:00Z
One of the jobs finished. Job count 5. Batch pending count updated to 1.
1 new job was re-enqueued. Duplicate of the job pending in the batch. Start time: 2017-02-09T22:42:00Z

mperham · 2017-02-11T19:01:26Z

Looking at 7bc6b7ca307d26ea21f030e0, the times look consistent:

created_at 2017-02-09 18:32:52 UTC
enqueued_at 2017-02-09 18:32:52 UTC
restart 2017-02-09T20:42:00Z
unlocks_at 2017-02-10 18:32:52 UTC

Could you show me the deployment code? How do you quiet/restart Sidekiq?

raivil · 2017-02-13T14:22:26Z

@mperham
Foreman is used to export upstart scripts that will handle starts/stops.
Upstart will kill main sidekiq process with sigterm.

start on starting my_services-worker
stop on stopping my_services-worker
respawn

exec su - myuser --session-command 'cd /my-app/releases/20170200000000; export PORT=5000; RAILS_ENV=$RAILS_ENV MAXMEM_KB=$MAXMEM bundle exec sidekiqswarm -q critical -q default -q medium -q low >> /my-app/shared/log/worker-1.log 2>&1'

My biggest concern is why the jobs were re-enqueued in one hour intervals (2017-02-09T20:42:00Z, 2017-02-09T21:42:00Z, 2017-02-09T22:42:00Z) while original job was still executing.

mperham · 2017-02-13T18:23:18Z

timed_fetch will re-enqueue jobs which have not finished within one hour, as noted in the cons of the wiki page:

https://github.com/mperham/sidekiq/wiki/Pro-Reliability-Server#timed_fetch

Switch to super_fetch.

raivil · 2017-02-16T12:06:20Z

Thanks Mike.
I'll check super_fetch.

mperham closed this as completed Feb 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Batches, Retry and unique jobs #3312

Question about Batches, Retry and unique jobs #3312

raivil commented Jan 11, 2017 •

edited

Loading

mperham commented Jan 11, 2017

raivil commented Jan 11, 2017 •

edited

Loading

raivil commented Jan 13, 2017

mperham commented Jan 13, 2017

mperham commented Jan 13, 2017

raivil commented Jan 18, 2017

mperham commented Jan 20, 2017

raivil commented Jan 30, 2017

kroehre commented Jan 31, 2017 •

edited

Loading

mperham commented Jan 31, 2017

kroehre commented Jan 31, 2017

raivil commented Feb 9, 2017 •

edited

Loading

mperham commented Feb 11, 2017

raivil commented Feb 13, 2017 •

edited

Loading

mperham commented Feb 13, 2017

raivil commented Feb 16, 2017

Question about Batches, Retry and unique jobs #3312

Question about Batches, Retry and unique jobs #3312

Comments

raivil commented Jan 11, 2017 • edited Loading

mperham commented Jan 11, 2017

raivil commented Jan 11, 2017 • edited Loading

raivil commented Jan 13, 2017

mperham commented Jan 13, 2017

mperham commented Jan 13, 2017

raivil commented Jan 18, 2017

mperham commented Jan 20, 2017

raivil commented Jan 30, 2017

kroehre commented Jan 31, 2017 • edited Loading

mperham commented Jan 31, 2017

kroehre commented Jan 31, 2017

raivil commented Feb 9, 2017 • edited Loading

mperham commented Feb 11, 2017

raivil commented Feb 13, 2017 • edited Loading

mperham commented Feb 13, 2017

raivil commented Feb 16, 2017

raivil commented Jan 11, 2017 •

edited

Loading

raivil commented Jan 11, 2017 •

edited

Loading

kroehre commented Jan 31, 2017 •

edited

Loading

raivil commented Feb 9, 2017 •

edited

Loading

raivil commented Feb 13, 2017 •

edited

Loading