Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU overload with v3.0.4 #348

Closed
xymox opened this issue Mar 28, 2017 · 14 comments
Closed

CPU overload with v3.0.4 #348

xymox opened this issue Mar 28, 2017 · 14 comments

Comments

@xymox
Copy link

xymox commented Mar 28, 2017

Hi,

It seems that modifications introduced in v3.0.4 (in lib/shoryuken/manager.rb I suppose) cause CPU overload over 95% on some plateforms (tested on Ubuntu and Mac os x).

Revert back to v3.0.3 fix the problem. I haven't test on master branch but I can do some tests and write steps to reproduce if needed.

Thanks for your attention,

Philippe

@phstc
Copy link
Collaborator

phstc commented Mar 28, 2017

Hi @xymox

hm this may cause a kind of a constant loop in case you don't have ready workers or your queues are empty.

What's your setup? Concurrency? Do you usually have empty queues or long running jobs? I would like to try to reproduce it.

@xymox
Copy link
Author

xymox commented Mar 28, 2017

Hi @phstc,

It seems that my problem comes from the delay parameter. If I switch delay to 0 in configuration, everything is OK, but if I use a bigger value the problem appear again.

I'm testing shoryuken in a blank rails app, with only a default active_job worker. My configuration look like :

delay: 2
queues:
  - [shoryuken_debug, 1]

I'm using this configuration with a newly created SQS queue.

I can put my app on a public repo if you need it.

Philippe

@phstc
Copy link
Collaborator

phstc commented Mar 28, 2017

So delay: 0 works fine, delay: 2 overloads the CPU?

I'm using this configuration with a newly created SQS queue.

So your queue is empty, right?

@xymox
Copy link
Author

xymox commented Mar 28, 2017

Exactly, delay > 0 overloads the CPU and I'm using an empty queue.

@phstc
Copy link
Collaborator

phstc commented Mar 31, 2017

Hi @xymox

I couldn't reproduce it. Do I need to keep the app running for a long while? Do you have any wait time configured in the queue?

@nishio-dens
Copy link

Hi @phstc @xymox

I encountered the same problem.
Here is my environments,

  • Machine 1 (Local dev machine)
    • Shoryuken 3.0.4
    • ruby 2.3.1p112
    • Mac OSX 10.12.1
  • Machine 2 (Production)
    • Shoryuken 3.0.4
    • ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
    • AWS EC2 ( Linux version 3.10.0-514.10.2.el7.x86_64 )

And My configuration is as follows,

concurrency: 5
delay: 5
queues:
  - ['sys-s3-notification', 2]
  - ['sso-sys', 2]
  - ['sys-test-capture-overview', 2]
  - ['sys-test-capture-current', 2]
  - ['sys-ad-breaks', 2]

I revert back to shoryuken 3.0.3, fix the problem.

I check syoryuken master branch source code, and after correcting the source code as below,
the CPU utilization has decreased greatly.
(But I think it's not the good solution)

diff --git a/lib/shoryuken/manager.rb b/lib/shoryuken/manager.rb
index 978b49b..f081001 100644
--- a/lib/shoryuken/manager.rb
+++ b/lib/shoryuken/manager.rb
@@ -53,7 +53,10 @@ module Shoryuken

     private

+
+    DISPATCH_INTERVAL = 0.1
     def dispatch_async
+      sleep DISPATCH_INTERVAL
       @dispatcher_executor.post(&method(:dispatch_now))
     end

@jjoos
Copy link

jjoos commented Apr 10, 2017

Experiencing the same issue.

This seems to be the pr causing this: #345

Will try the delay zero workaround and otherwise revert to the previous version.

@jjoos
Copy link

jjoos commented Apr 10, 2017

I just looked at the code and why this causes issues with delay != 0, since it sounded weird and interesting.

This is my current theory:
So the normal situation with delay = 0, the main loop is being throttled by having to wait in the dispatch_now method since it calls s3 to retrieve messages (in either dispatch_batch or dispatch_single_message). Since delay = 0 and it doesn't pause any queue it always does this. On my machine it takes about a quarter of a second and blocking the thread, reducing load.

After #345 when a queue has a delay and no messages it's being paused most of the time, resulting in the return unless (queue = @polling_strategy.next_queue) guard in dispatch_now being triggered and the slow fetch not being executed. This speeds up the loop so much (and removes the wait time) that it's going to use 100% cpu) except every time the queue is being un paused.

I didn't explore in depth why the old solution worked, but it's probably due to the heartbeat stuff slowing the main loop down with the 0.1 second execution interval?

phstc pushed a commit that referenced this issue Apr 10, 2017
Fix #348

When delay is specified, dispatch gets into an endloop (the higher delay, the
worse)
@phstc
Copy link
Collaborator

phstc commented Apr 10, 2017

Hi all

Thanks a lot for these feedbacks, they helped a lot.

Interesting enough, if I don't enable logging, my CPU stays around 60% (still high), but if I enable it, it goes easily to 90%. Are you guys using -v?

@nishio-dens @jjoos I mixed up your feedbacks and added a delay only when there will be no fetching from SQS (which is a delay in somehow itself you pointed out): https://github.com/phstc/shoryuken/pull/354/files#diff-9fdc6bf30b3f5b4078e1e4d4720765b7R66

WDYT?

phstc added a commit that referenced this issue Apr 10, 2017
…-avoid-cpu-overload-348

Pause before dispatching to avoid CPU overload

Fix #348
@jjoos
Copy link

jjoos commented Apr 11, 2017

Interesting enough, if I don't enable logging, my CPU stays around 60% (still high), but if I enable it, it goes easily to 90%. Are you guys using -v?

Is that for a single core or the whole processor? In my case it takes a complete core without -v.

Solution is fine, thanks for the quick fix!

@phstc
Copy link
Collaborator

phstc commented Apr 11, 2017

Hi @jjoos

I got 60%-90% in a single core.

If you run the latest in production, please let me know how it goes! 🍻

@jjoos
Copy link

jjoos commented Apr 12, 2017

We just updated to the latest and changed the delay back from 0 to 1, see the line for the release.
screen shot 2017-04-12 at 4 30 39 pm

everything seems to be working well!

@phstc
Copy link
Collaborator

phstc commented Apr 12, 2017

@jjoos cool, I see. The spikes were when there were no messages in the queues, right?

@jjoos
Copy link

jjoos commented Apr 13, 2017

Well I also changed the delay setting back from 0 to 1, besides updating it. So not having major changes in load is fine!

The spikes are actual jobs being processed on that machine. I'm not sure why the spikes where higher before the release then after. That's probably unrelated....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants