BUG: When restarting a failed task, the task completes but state of job is stuck in running. #115

bwilliams42 · 2014-09-03T19:29:00Z

i also think the job should continue from where it left off. After the failed task is re-ran, it should move onto the next task(s)

thieman · 2014-09-03T19:33:26Z

Can you provide a minimal test case here? The job export functionality can be useful for that. Does this involve any remote tasks?

bwilliams42 · 2014-09-03T19:51:48Z

@thieman. Sure. If you create a job that has say 5 tasks. 1 > 2 > 3 > 4 > 5. You purposely make number 3 fail. The execution stops there. 4 and 5 are not executed. You then fix 3 and hit the re-run failed tasks button, 3 will complete successfully. The state of the overall job remains as running. So I can't restart the the job using the "start job from beginning" button. This is definitely a bug. I need to restart the server to force it back to a normal state.

There are no remote tasks. How would the export help?

-B

thieman · 2014-09-03T19:56:53Z

Sorry, was just suggesting export as a way of passing along a test job. Will take a look.

bwilliams42 · 2014-09-03T20:03:10Z

Ah i see. I can whip one up and send it over a little later. It will be later on tonight/tomorrow. Is that OK?

thieman · 2014-09-03T20:03:51Z

Sure. Sounds simple enough that it may not be necessary, but will be helpful if I encounter issues.

nnfuzzy · 2014-09-04T05:40:56Z

I experienced the same. Here are simple templates.
https://github.com/nnfuzzy/dagobah/tree/master/tests

rclough · 2014-09-26T06:13:21Z

I believe the issue arrived with one of my more recent PRs, where we now are operating on a copy of the graph. When a task fails, the DAG snapshot gets deleted. When you retry, the snapshot is not reconstructed, and thus after the task finishes, downstream() doesnt operate as expected because snapshot is None

My initial inclination is just to rebuild the snapshot within the retry() method. I'm trying to think if there are any edge cases in which we would want to preserve the snapshot on failure (especially since fixing a failure may mean changing the the DAG).

thieman · 2014-09-26T20:13:28Z

Should be corrected now, going to push out a release shortly.

thieman · 2014-09-26T20:19:02Z

v0.3.1 is live on PyPi with this fix

bwilliams42 changed the title ~~When restarting a failed task, the task completes but state of job is stuck in running.~~ BUG: When restarting a failed task, the task completes but state of job is stuck in running. Sep 3, 2014

rclough mentioned this issue Sep 26, 2014

Fix broken job restarts after failures #116

Merged

thieman closed this as completed in #116 Sep 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: When restarting a failed task, the task completes but state of job is stuck in running. #115

BUG: When restarting a failed task, the task completes but state of job is stuck in running. #115

bwilliams42 commented Sep 3, 2014

thieman commented Sep 3, 2014

bwilliams42 commented Sep 3, 2014

thieman commented Sep 3, 2014

bwilliams42 commented Sep 3, 2014

thieman commented Sep 3, 2014

nnfuzzy commented Sep 4, 2014

rclough commented Sep 26, 2014

thieman commented Sep 26, 2014

thieman commented Sep 26, 2014

BUG: When restarting a failed task, the task completes but state of job is stuck in running. #115

BUG: When restarting a failed task, the task completes but state of job is stuck in running. #115

Comments

bwilliams42 commented Sep 3, 2014

thieman commented Sep 3, 2014

bwilliams42 commented Sep 3, 2014

thieman commented Sep 3, 2014

bwilliams42 commented Sep 3, 2014

thieman commented Sep 3, 2014

nnfuzzy commented Sep 4, 2014

rclough commented Sep 26, 2014

thieman commented Sep 26, 2014

thieman commented Sep 26, 2014