-
Notifications
You must be signed in to change notification settings - Fork 58
SWF Schedules activities to disconnected workers still within 60 second polling window #31
Comments
As a test, I performed the following with the workers as outlined above: 20.times do
$rewquest_workflow_client.start_execution(b)
end With the forking workers, there are many activities that are scheduled and "started" but never complete. The activity is a simple passthrough activity, without any real computation. When I change to :use_forking => false, I run into no such issues, and 100% of the tasks complete, though the "I have not been able to poll successfully...." error is still being logged. |
Looks like this is causing the error, as task can be nill if the poll is empty, and then it will ask for #activity_type on a nil object, getting the error you see. The reason why it doesn't occur when you turn forking off is because there is probably enough work that the polls aren't returning empty. We'll fix this up and get out a patch as soon as possible. Thanks for the great bug report! |
@mjsteger this error could be a red herring -- I do get this error with To restate the problem: When using a forking worker, jobs are accepted by the poller, but randomly never executed, and timeout. |
@mjsteger after more testing, with
The task log looks like:
The task is just a |
We're still looking into this issue. Just to be sure we're looking in the right place, you have a simple activity which simply puts, and you have a workflow which simply calls that one activity, correct? Also, the timeouts are probabilistic, correct? If so, can you comment on what % you see as failures, on average? |
@mjsteger I'm working on a small script that shows the bug. I'll post it here. |
@mjsteger after more testing, I've narrowed down the issue and I've reproduced it multiple times. It seems like the TCP long polling connection isn't being terminated on the SWF end when an activity worker exits. I can reproduce with these steps (within the 60 second long polling window):
SWF seems to hand off the execution to the shutdown activity worker, and mark the task as started EVEN THOUGH the worker never actually processes the task. Looking at the debug logs you will see that the worker never even gets the task token. Is this known behavior? It seems like the worker should be required to ack the token before a task is marked as started? |
@Jud, thanks a lot for the clear repro. The problem you are seeing happens when a client disconnects while a long poll request is open. In this case the service may not realize the disconnect and dispatch a task thinking that the poller is still there. This is usually not a problem in production where you have workers running and polling for tasks on a continuous basis. But you may run into this more frequently in testing if you bring up and take down pollers quickly. This is not necessarily a problem in the Flow Framework, but we are looking into resolving it. In the meantime you can try using a different task list for each test run, for instance use a uuid. This will ensure that tasks for each run go to the new poller. |
@pmohan6 Thanks for the idea of using a different task list, though, won't that require coordination between the deciders and the activity workers? I know this isn't really an issue w/ the framework, and thanks for looking into it for me. |
@pmohan6 @mjsteger -- I don't think I agree that this isn't an issue in production. For instance, when we upgrade code and restart our workers, any new executions will time out until the socket is closed. We deploy multiple times a day. Any pointers on how to do seamless activity upgrades that doesn't require modifying every workflow that uses the activity? |
@mjsteger @pmohan6 Would it be possible to work around this issue using timers? E.g. start a 5 second timer when the Decider starts an activity, and have the activity cancel the timer? The activity tasks are given something like 200 seconds to complete, so waiting for them to time out isn't an option. |
@mjsteger @pmohan6 I realize this might not be the correct forum for this request, but I'd imagine I'm not the only person who rolls code and needs to restart activity workers (Who are subsequently scheduled tasks after they are disconnected, requiring the entire timeout period to elapse before retrying). Is there a better solution? |
@Jud Sorry about the delay in responding. I would suggest using activity heartbeats in this case so that your activities timeout early instead of waiting for the schedule task to timeout. You can set a short enough heartbeat so that when you take your workers down and SWF doesn't receive a heartbeat from the workers, it will timeout the activity task and retry (based on your retry policy). I'm working on a fix for clean shutdown of activity and workflow workers to wait for the open long poll to finish before shutting down. After this fix, you will just need to do a clean shutdown when you want to restart workers so that there is no 'phantom' dispatch of tasks. |
@pmohan6 I see where to set the timeout -- does aws-flow send heartbeats for me, or no? How would I send a heartbeat during a task? |
@Jud
You can read more about heartbeats here - http://docs.aws.amazon.com/amazonswf/latest/developerguide/swf-dg-develop-activity.html#swf-dg-managing-activity-tasks Hope that helps! |
Yeah, I was running into the exac same issue. What I did was, I aborted (http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Request.html#abort-property) the long poll request before exiting the app. Here's some pseudo code for a Node.js SWF app: var interruptCount = 0,
FORCE_QUIT_LIMIT = 3,
Task = function Task(){},
task;
Task.prototype = {
stopPoll: function stopPoll() {
if (this.hasOwnProperty('poller')) {
this.stop = true;
this.poller.abort();
}
},
poll: function poll() {
var self = this;
this.poller = swf.client.pollForActivityTask({
domain: domain,
taskList: {
name: 'name'
},
identity: 'something'
});
this.poller
.on('error', function onDone(response) {
if (self.hasOwnProperty('stop') && self.stop) {
process.exit();
return;
}
});
.send();
}
}
task = new Task();
process.on('SIGINT', function onInterrupt() {
interruptCount += 1;
if (interruptCount == FORCE_QUIT_LIMIT) {
process.exit();
}
log('SIGINT! Stopping when possible. ' + (FORCE_QUIT_LIMIT - interruptCount) + ' more to force quit.');
task.stopPoll();
}); |
@pmohan6 Just wondering about the state of the fix for clean shutdown of long polling workers you mentioned earlier? |
@acant Unfortunately, we don't have a fix for this yet. I'll post an update when we do have one. |
I'm running into the same issue using Node.js SDK... @karvapallo Your solution seemed perfect, and it lowered the amount of errors. However, sometimes SWF still schedules the task to inactive worker. Have you been able to totally get rid of the errors? |
I'm having the same problem described above. In production, whenever I deploy new code, tasks fail. This happens every time I deploy because SFW assigns tasks to dead workers. Retrying isn't an option in my case because the tasks are not idempotent (which is why I am using SWF in the first place, if they were idempotent, I could have just used SQS). |
With the way the code is currently structured a task will be orphaned every time the worker is shut down as long as there is at least one free fork available when the worker is shut down. The worker is in an infinite loop doing this (from loop do
run_once(false, poller)
end and the important part of if @shutting_down
Kernel.exit
end
poller.poll_and_process_single_task(@options.use_forking)
def poll_and_process_single_task(use_forking = true)
begin
if use_forking
@executor.block_on_max_workers
end
task = @domain.activity_tasks.poll_for_single_task(@task_list)
if task.nil?
return false
end
if use_forking
@executor.execute { process_single_task(task) }
else
process_single_task(task)
end
return true
end Which means that the So lets say you have 1 worker with 2 forks per worker allocated and your current workload is such that you always have exactly 1 task happening at a time. As soon as that task is completed, another one (and only) one is scheduled. This means that the poll loop will get the first task, which will then start executing in a fork, leaving the main process to start the pole loop again, which will hit the This will continue correctly until the worker process in shutdown. The interrupt handler looks like this. @shutdown_first_time_function = lambda do
@executor.shutdown Float::INFINITY
Kernel.exit
end which causes the main process to wait for the executor to finish, waiting for the currently executing task, and then ends the main process. But if you remember, the main process was currently in a long poll, and there is no attempt to close that long poll here. This causes the connection to be left open on AWS's end and AWS will happily allocate and return a task to that long poll as soon as one becomes available. This is not only a problem during testing, as it occurs any time a worker is shut down, which happens every time you deploy new code, reboot your serve, etc. It is clearly caused by the Flow Framework handling polling incorrectly. As a cludge right now, if you set the |
It looks like the AWS SDK for ruby doesn't support closing the long polling, which means that the worker needs to wait for any current polling to complete and also complete the returned task if any before shutting down. |
I think I may have a fix for this. I'm going to try a few things and then I'll open a PR if they work. |
I have a fix at #118. I'm pretty confident that it is the right way to fix it. Feel free to leave any questions or feedback there. |
Am experiencing the same problem (I think), even with the hello_world sample. It makes sense that the decider is giving tasks to dead workers if you have restarted them with a new code deploy. But then wouldn't the solution be when deploying new code to increase version numbers so that new activities won't be given to older version workers? (At least from my understanding so far of SWF that is how it works). But yeah, then it is a pain for debugging when you need to stop/start these processes over and over. If you had a cron job that detects if a worker is running on that server (or X number of workers) and starts needed workers incase some crashed, then you would have the same issue. Tasks would be given to a worker that no longer exists. |
My fork takes the approach of letting the worker finish its current task before shutting it down. That way it doesn't shut down with an open long poll to AWS and AWS won't give it a task after it's gone. |
My activity workers correctly process tasks 60% of the time, then they start throwing this error.
"I have not been able to poll successfully, and am now bailing out, with error undefined method `activity_type' for nil:NilClass"
Possibly something to do with the forking?
My workers look like:
I'm running this worker on ubuntu/EC2.
@mjsteger
The text was updated successfully, but these errors were encountered: