job failures not correctly reducing retry count #34

benedictpaten · 2015-04-20T05:33:55Z

Courtesy of Tim Sackton:

I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.

Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.

Is this a bug or an error in my code/understanding? It could easily be the latter....

Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job failures not correctly reducing retry count #34

job failures not correctly reducing retry count #34

benedictpaten commented Apr 20, 2015

job failures not correctly reducing retry count #34

job failures not correctly reducing retry count #34

Comments

benedictpaten commented Apr 20, 2015