You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.
Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.
Is this a bug or an error in my code/understanding? It could easily be the latter....
Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.
The text was updated successfully, but these errors were encountered:
Courtesy of Tim Sackton:
I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.
Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.
Is this a bug or an error in my code/understanding? It could easily be the latter....
Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.
The text was updated successfully, but these errors were encountered: