Manager occasionally dying #428

ethangunderson · 2017-08-28T18:26:35Z

We're running into a problem where ~10 times a month, a shoryuken manager will die. The logs around the shutdown look like this:

Manager failed: No Content-Type received. Service returned the HTTP status code: 500

ERROR: /var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/plugins/raise_response_errors.rb:15:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/aws-sdk-core/plugins/idempotency_token.rb:18:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/aws-sdk-core/plugins/param_converter.rb:20:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/plugins/response_target.rb:21:in `call'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/request.rb:70:in `send_request'
/var/app/current/vendor/bundle/gems/aws-sdk-core-2.7.7/lib/seahorse/client/base.rb:207:in `block (2 levels) in define_operation_methods'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/queue.rb:43:in `receive_messages'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/fetcher.rb:35:in `receive_messages'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/fetcher.rb:16:in `fetch'
/var/app/current/vendor/bundle/gems/shoryuken-3.1.7/lib/shoryuken/manager.rb:82:in `dispatch_single_messages'

Received USR1, will soft shutdown down

The error message seems to indicate that SQS returned a 500 when the manager attempted to fetch messages. That exception is caught here, which results in a shutdown here.

Does that seem right? Is that the right behavior, or am I missing something?

The text was updated successfully, but these errors were encountered:

phstc · 2017-08-28T18:55:22Z

Hi @ethangunderson

Before #369 Shoryuken used to continuous retry failures while fetching, but this was also causing problems, in case that a retry wouldn't ever work - like if the instance permanently loses the connection.

Maybe we could try some exponential backoff:

def fetch(queue, limit)
  # fetch code
rescue AWSERROR # need to check the base error class name
  if retry_count <= 10
    sleep(retry_count * 1)
    retry
  else
    raise
  end
end

Would a retry after a few seconds work in your case?

ethangunderson · 2017-08-28T19:05:33Z

Thanks for clarifying.

Yep, a retry like that would work fine for us. We're running shoryuken on ~12 servers total right now, and we've only seen the issue on one server at a time.

phstc · 2017-08-28T19:10:36Z

@ethangunderson do you know what's the exception that was thrown? So I can create a more specific rescue?

ethangunderson · 2017-08-28T19:14:12Z

Manager failed: No Content-Type received. Service returned the HTTP status code: 500

That's all that I have in logs.

phstc · 2017-09-02T21:13:26Z

@ethangunderson I've just released 3.1.11, could you try it out? This version will auto retry fetch errors (up to 3 times).

phstc mentioned this issue Aug 29, 2017

Auto retry fetch errors #429

Merged

phstc closed this as completed in #429 Sep 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manager occasionally dying #428

Manager occasionally dying #428

ethangunderson commented Aug 28, 2017

phstc commented Aug 28, 2017 •

edited

Loading

ethangunderson commented Aug 28, 2017

phstc commented Aug 28, 2017

ethangunderson commented Aug 28, 2017

phstc commented Sep 2, 2017 •

edited

Loading

Manager occasionally dying #428

Manager occasionally dying #428

Comments

ethangunderson commented Aug 28, 2017

phstc commented Aug 28, 2017 • edited Loading

ethangunderson commented Aug 28, 2017

phstc commented Aug 28, 2017

ethangunderson commented Aug 28, 2017

phstc commented Sep 2, 2017 • edited Loading

phstc commented Aug 28, 2017 •

edited

Loading

phstc commented Sep 2, 2017 •

edited

Loading