Skip to content
This repository has been archived by the owner on Jan 10, 2019. It is now read-only.

Starting worker with should_register set to true causes ThrottlingExceptions #82

Open
Tsquare opened this issue Dec 15, 2014 · 9 comments

Comments

@Tsquare
Copy link

Tsquare commented Dec 15, 2014

We have tens of workflow types and hundreds of activity types, which if we try to register on activity/workflow worker startup end up sending one request per type to SWF in rapid succession, causing:

AWS::SimpleWorkflow::Errors::ThrottlingException Rate exceeded

There should be a way to avoid registering already-registered types (which is most of them).

@pmohan6
Copy link
Contributor

pmohan6 commented Dec 19, 2014

Flow could potentially do this in 3 ways -

  1. Try to register the workflow/activity and if it fails with TypeAlreadyExists, then continue on (this is what flow does currently).
  2. Try to call describe_workflow/activity_type on each type and see if it exists and then try to register. Doesn't really solve the problem because we are shifting the load from Register to Describe.
  3. Try to call list_workflow/activity_type on each domain. While this could potentially reduce the number of calls, it is not guaranteed since List is a best effort call. Moreover, if the domain already has a lot of types, the call will have to page through to get the entire list. And if you have a lot of workers starting up, you are likely to get throttled on this too.

We currently recommend users to start their workers in a staggered way instead of all at once.

You can also request a limit increase from SWF here.

@Tsquare
Copy link
Author

Tsquare commented Dec 19, 2014

Can't you fetch all the known activity types using a single call:

http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/SimpleWorkflow/Domain.html#activity_types-instance_method?

@pmohan6
Copy link
Contributor

pmohan6 commented Dec 19, 2014

Domain#activity_types is just a ruby sdk abstraction that sits on top of the SWF client. It calls #list_activity_types internally and pages through all the results to return back the entire list.

@Tsquare
Copy link
Author

Tsquare commented Dec 19, 2014

My reading of http://docs.aws.amazon.com/amazonswf/latest/apireference/API_ListActivityTypes.html is that it returns all the types if you don't specify a name. Is that wrong?

@pmohan6
Copy link
Contributor

pmohan6 commented Dec 19, 2014

Right, it returns all the types if you don't specify a name. What I meant to say was if the list is large enough, it will make multiple calls to page through the entire list. List is also a slower and potentially more expensive call than Register. It is difficult to predict which one will cost the customers less since it depends on the usage scenario. In our experience, register method seems to come out cheaper.

@Tsquare
Copy link
Author

Tsquare commented Dec 19, 2014

But it sounds like each page returns up to 100 types -- so instead of 100 register calls, you'd have a single list call, followed by checking which types are still unregistered and only registering those (and so on for each page). In term of # of calls that seems like a large savings in the typical case (in which a worker restarts and most of the types are already registered).

@pmohan6
Copy link
Contributor

pmohan6 commented Jan 14, 2015

We will explore using List before trying to register and look at the cost difference between the two methods.

@pmohan6
Copy link
Contributor

pmohan6 commented Jan 23, 2015

We have changed the implementation of the runner to register only using the first worker for each set of workers. This should reduce the number of register calls by a factor of number_of_workers if you are using the runner to start your workers.

Closing the issue for now but please feel free to reopen if you think this doesn't solve the issue. Thanks!

396dd80

@pmohan6 pmohan6 closed this as completed Jan 23, 2015
pmohan6 referenced this issue Jan 23, 2015
… worker in the runner and a couple of minor bug fixes
@Tsquare
Copy link
Author

Tsquare commented Jan 23, 2015

we've already implemented this fix ourselves and unfortunately it doesn't resolve the issue. (i can't reopen since i'm not a collaborator on this repo)

@pmohan6 pmohan6 reopened this Jan 23, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant