Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strict priority polling / bug fix #263

Conversation

atyndall
Copy link
Contributor

@atyndall atyndall commented Oct 27, 2016

TL;DR @mariokostelac's PR #236 restructured Shoryuken to improve performance for our jobs significantly / enabled implementation of our use case, so we forked off his PR and implemented Strict Priority Polling similar to that proposed by @gshutler in #106. We additionally found a bug that was causing a lot of problems with our very short running jobs (< 50 ms per job).

Not entirely sure how to handle the merging of this, as @mariokostelac's work in #236 is yet to be integrated, but we thought it would be good to make people aware of it.


To describe our Strict Priority polling implementation; we adapted @gshutler's implementation in #106 to mostly preserve its strict priority properties, but to also introduce the ability for queues to pause themselves using the delay: option when they find they have no messages available to process.


To describe the bug; we noticed that after 15-30 minutes minutes of running the queueing system, the job processing speed would begin to slow to a crawl, going from 10 jobs a second to less than 1 job a second. This was with a config like this;

timeout: 30     # Seconds grace to shutdown
concurrency: 1  # The number of allocated threads to process messages. Default 25
delay: 30       # The delay in seconds to pause a queue when it's empty. Default 0
queues:
  - [queue_1, 1]

Reviewing the verbose log output, we saw that the following log lines would slowly increase in frequency until they were begin logged many times a second and would start to generate many megabytes of log files every minute, filled 90% with these two lines, and occasionally with notification of a new job starting or completing. This effect would quickly render the Shoryuken process as ineffective and it eventually completely stop processing work.

DEBUG: Ready: 0, Busy: 1, Active Queues: [["queue_1", 1]]
DEBUG: Pausing fetcher, because all processors are busy

Our working theory is that when Shoryuken::Manager#processor_done calls Shoryuken::Manager#dispatch in quick succession, as happens with our very short running jobs, the after(1) calls within it schedule #dispatch to run again and again. As the terminating case for these recursive calls is for the processor to become available, and more and more of these become queued, it becomes less and less likely that they will be cleaned up, until eventually they grow to a number that completely chokes the Shoryuken process.

Our fixes is to change after(1) { dispatch } to

@dispatch_timer ||= after(1) do
  @dispatch_timer = nil
  dispatch
end

What this does is effectively ensure that only one #dispatch will be queued into the future at any one time. Our testing has observed that after this change, this problem does not reoccur.

@dispatch_timer ||= after(1) do
@dispatch_timer = nil
dispatch
end
Copy link
Collaborator

@phstc phstc Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! 🍻

@phstc
Copy link
Collaborator

phstc commented Nov 2, 2016

@mariokostelac I'm happy with this PR, WDYT? If you like so, could you fix the merge conflicts in #236, then we can merge it?

@atyndall great work and investigation 🍻

phstc pushed a commit that referenced this pull request Dec 3, 2016
phstc pushed a commit that referenced this pull request Dec 3, 2016
phstc pushed a commit that referenced this pull request Dec 3, 2016
@phstc phstc mentioned this pull request Dec 3, 2016
phstc added a commit that referenced this pull request Dec 3, 2016
@mariokostelac
Copy link
Contributor

@atyndall It's a great writeup! I think that it would be even more correct to cancel the old timer and start it again (since we know that there are no jobs in the queue). Am I wrong?

@atyndall
Copy link
Contributor Author

atyndall commented Dec 13, 2016

@mariokostelac I did investigate this approach, and although my memory is a bit hazy at this stage, IIRC it seemed like the act of cancelling the existing timer was quite expensive and slowed things down, compared to letting it run.

@phstc
Copy link
Collaborator

phstc commented Dec 13, 2016

Closing in favor of #284

@phstc
Copy link
Collaborator

phstc commented Apr 3, 2017

hi @atyndall

You would mind reviewing this https://github.com/phstc/shoryuken/wiki/Polling-strategies#strictpriority? Feel free to update it 🍻

@atyndall
Copy link
Contributor Author

atyndall commented Apr 4, 2017

@phstc Added some additional content, looks great!

@phstc
Copy link
Collaborator

phstc commented Apr 4, 2017

@atyndall awesome! Thanks 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants