Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager occasionally dying #428

Closed
ethangunderson opened this issue Aug 28, 2017 · 5 comments
Closed

Manager occasionally dying #428

ethangunderson opened this issue Aug 28, 2017 · 5 comments

Comments

@ethangunderson
Copy link

We're running into a problem where ~10 times a month, a shoryuken manager will die. The logs around the shutdown look like this:

Manager failed: No Content-Type received. Service returned the HTTP status code: 500

ERROR: /var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/plugins/raise_response_errors.rb:15:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/aws-sdk-core/plugins/idempotency_token.rb:18:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/aws-sdk-core/plugins/param_converter.rb:20:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/plugins/response_target.rb:21:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/request.rb:70:in `send_request'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/base.rb:207:in `block (2 levels) in define_operation_methods'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/queue.rb:43:in `receive_messages'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/fetcher.rb:35:in `receive_messages'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/fetcher.rb:16:in `fetch'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/manager.rb:82:in `dispatch_single_messages'

Received USR1, will soft shutdown down

The error message seems to indicate that SQS returned a 500 when the manager attempted to fetch messages. That exception is caught here, which results in a shutdown here.

Does that seem right? Is that the right behavior, or am I missing something?

@phstc
Copy link
Collaborator

phstc commented Aug 28, 2017

Hi @ethangunderson

Before #369 Shoryuken used to continuous retry failures while fetching, but this was also causing problems, in case that a retry wouldn't ever work - like if the instance permanently loses the connection.

Maybe we could try some exponential backoff:

def fetch(queue, limit)
  # fetch code
rescue AWSERROR # need to check the base error class name
  if retry_count <= 10
    sleep(retry_count * 1)
    retry
  else
    raise
  end
end

Would a retry after a few seconds work in your case?

@ethangunderson
Copy link
Author

Thanks for clarifying.

Yep, a retry like that would work fine for us. We're running shoryuken on ~12 servers total right now, and we've only seen the issue on one server at a time.

@phstc
Copy link
Collaborator

phstc commented Aug 28, 2017

@ethangunderson do you know what's the exception that was thrown? So I can create a more specific rescue?

@ethangunderson
Copy link
Author

Manager failed: No Content-Type received. Service returned the HTTP status code: 500

That's all that I have in logs.

@phstc
Copy link
Collaborator

phstc commented Sep 2, 2017

@ethangunderson I've just released 3.1.11, could you try it out? This version will auto retry fetch errors (up to 3 times).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants