CPU overload with v3.0.4 #348

xymox · 2017-03-28T08:11:08Z

Hi,

It seems that modifications introduced in v3.0.4 (in lib/shoryuken/manager.rb I suppose) cause CPU overload over 95% on some plateforms (tested on Ubuntu and Mac os x).

Revert back to v3.0.3 fix the problem. I haven't test on master branch but I can do some tests and write steps to reproduce if needed.

Thanks for your attention,

Philippe

phstc · 2017-03-28T12:06:36Z

Hi @xymox

hm this may cause a kind of a constant loop in case you don't have ready workers or your queues are empty.

What's your setup? Concurrency? Do you usually have empty queues or long running jobs? I would like to try to reproduce it.

xymox · 2017-03-28T15:01:19Z

Hi @phstc,

It seems that my problem comes from the delay parameter. If I switch delay to 0 in configuration, everything is OK, but if I use a bigger value the problem appear again.

I'm testing shoryuken in a blank rails app, with only a default active_job worker. My configuration look like :

delay: 2
queues:
  - [shoryuken_debug, 1]

I'm using this configuration with a newly created SQS queue.

I can put my app on a public repo if you need it.

Philippe

phstc · 2017-03-28T15:07:04Z

So delay: 0 works fine, delay: 2 overloads the CPU?

I'm using this configuration with a newly created SQS queue.

So your queue is empty, right?

xymox · 2017-03-28T16:51:08Z

Exactly, delay > 0 overloads the CPU and I'm using an empty queue.

phstc · 2017-03-31T20:55:49Z

Hi @xymox

I couldn't reproduce it. Do I need to keep the app running for a long while? Do you have any wait time configured in the queue?

nishio-dens · 2017-04-10T04:26:50Z

Hi @phstc @xymox

I encountered the same problem.
Here is my environments,

Machine 1 (Local dev machine)
- Shoryuken 3.0.4
- ruby 2.3.1p112
- Mac OSX 10.12.1
Machine 2 (Production)
- Shoryuken 3.0.4
- ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
- AWS EC2 ( Linux version 3.10.0-514.10.2.el7.x86_64 )

And My configuration is as follows,

concurrency: 5
delay: 5
queues:
  - ['sys-s3-notification', 2]
  - ['sso-sys', 2]
  - ['sys-test-capture-overview', 2]
  - ['sys-test-capture-current', 2]
  - ['sys-ad-breaks', 2]

I revert back to shoryuken 3.0.3, fix the problem.

I check syoryuken master branch source code, and after correcting the source code as below,
the CPU utilization has decreased greatly.
(But I think it's not the good solution)

diff --git a/lib/shoryuken/manager.rb b/lib/shoryuken/manager.rb
index 978b49b..f081001 100644
--- a/lib/shoryuken/manager.rb
+++ b/lib/shoryuken/manager.rb
@@ -53,7 +53,10 @@ module Shoryuken

     private

+
+    DISPATCH_INTERVAL = 0.1
     def dispatch_async
+      sleep DISPATCH_INTERVAL
       @dispatcher_executor.post(&method(:dispatch_now))
     end

jjoos · 2017-04-10T09:21:10Z

Experiencing the same issue.

This seems to be the pr causing this: #345

Will try the delay zero workaround and otherwise revert to the previous version.

jjoos · 2017-04-10T12:51:33Z

I just looked at the code and why this causes issues with delay != 0, since it sounded weird and interesting.

This is my current theory:
So the normal situation with delay = 0, the main loop is being throttled by having to wait in the dispatch_now method since it calls s3 to retrieve messages (in either dispatch_batch or dispatch_single_message). Since delay = 0 and it doesn't pause any queue it always does this. On my machine it takes about a quarter of a second and blocking the thread, reducing load.

After #345 when a queue has a delay and no messages it's being paused most of the time, resulting in the return unless (queue = @polling_strategy.next_queue) guard in dispatch_now being triggered and the slow fetch not being executed. This speeds up the loop so much (and removes the wait time) that it's going to use 100% cpu) except every time the queue is being un paused.

I didn't explore in depth why the old solution worked, but it's probably due to the heartbeat stuff slowing the main loop down with the 0.1 second execution interval?

Fix #348 When delay is specified, dispatch gets into an endloop (the higher delay, the worse)

phstc · 2017-04-10T17:07:10Z

Hi all

Thanks a lot for these feedbacks, they helped a lot.

Interesting enough, if I don't enable logging, my CPU stays around 60% (still high), but if I enable it, it goes easily to 90%. Are you guys using -v?

@nishio-dens @jjoos I mixed up your feedbacks and added a delay only when there will be no fetching from SQS (which is a delay in somehow itself you pointed out): https://github.com/phstc/shoryuken/pull/354/files#diff-9fdc6bf30b3f5b4078e1e4d4720765b7R66

WDYT?

…-avoid-cpu-overload-348 Pause before dispatching to avoid CPU overload Fix #348

jjoos · 2017-04-11T12:46:13Z

Interesting enough, if I don't enable logging, my CPU stays around 60% (still high), but if I enable it, it goes easily to 90%. Are you guys using -v?

Is that for a single core or the whole processor? In my case it takes a complete core without -v.

Solution is fine, thanks for the quick fix!

phstc · 2017-04-11T13:11:56Z

Hi @jjoos

I got 60%-90% in a single core.

If you run the latest in production, please let me know how it goes! 🍻

jjoos · 2017-04-12T13:30:30Z

We just updated to the latest and changed the delay back from 0 to 1, see the line for the release.

everything seems to be working well!

phstc · 2017-04-12T15:46:13Z

@jjoos cool, I see. The spikes were when there were no messages in the queues, right?

jjoos · 2017-04-13T15:15:13Z

Well I also changed the delay setting back from 0 to 1, besides updating it. So not having major changes in load is fine!

The spikes are actual jobs being processed on that machine. I'm not sure why the spikes where higher before the release then after. That's probably unrelated....

phstc pushed a commit that referenced this issue Apr 10, 2017

Pause before dispatchting to avoid CPU overload

50efcc0

Fix #348 When delay is specified, dispatch gets into an endloop (the higher delay, the worse)

phstc mentioned this issue Apr 10, 2017

Pause before dispatching to avoid CPU overload #354

Merged

phstc closed this as completed in #354 Apr 10, 2017

phstc added a commit that referenced this issue Apr 10, 2017

Merge pull request #354 from phstc/give-a-pause-before-dispatching-to…

4504cd1

…-avoid-cpu-overload-348 Pause before dispatching to avoid CPU overload Fix #348

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU overload with v3.0.4 #348

CPU overload with v3.0.4 #348

xymox commented Mar 28, 2017

phstc commented Mar 28, 2017

xymox commented Mar 28, 2017

phstc commented Mar 28, 2017

xymox commented Mar 28, 2017

phstc commented Mar 31, 2017

nishio-dens commented Apr 10, 2017

jjoos commented Apr 10, 2017

jjoos commented Apr 10, 2017 •

edited

Loading

phstc commented Apr 10, 2017

jjoos commented Apr 11, 2017 •

edited

Loading

phstc commented Apr 11, 2017

jjoos commented Apr 12, 2017 •

edited

Loading

phstc commented Apr 12, 2017

jjoos commented Apr 13, 2017

CPU overload with v3.0.4 #348

CPU overload with v3.0.4 #348

Comments

xymox commented Mar 28, 2017

phstc commented Mar 28, 2017

xymox commented Mar 28, 2017

phstc commented Mar 28, 2017

xymox commented Mar 28, 2017

phstc commented Mar 31, 2017

nishio-dens commented Apr 10, 2017

jjoos commented Apr 10, 2017

jjoos commented Apr 10, 2017 • edited Loading

phstc commented Apr 10, 2017

jjoos commented Apr 11, 2017 • edited Loading

phstc commented Apr 11, 2017

jjoos commented Apr 12, 2017 • edited Loading

phstc commented Apr 12, 2017

jjoos commented Apr 13, 2017

jjoos commented Apr 10, 2017 •

edited

Loading

jjoos commented Apr 11, 2017 •

edited

Loading

jjoos commented Apr 12, 2017 •

edited

Loading