Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review possibility to re-submit jobs in TPV #71

Open
martindemko opened this issue Jan 24, 2025 · 0 comments
Open

Review possibility to re-submit jobs in TPV #71

martindemko opened this issue Jan 24, 2025 · 0 comments

Comments

@martindemko
Copy link
Collaborator

martindemko commented Jan 24, 2025

I copy the short communication from TPV channel on matrix:

Martin Demko:

Hi guys, a year ago (or so) I tried to make use of resubmission in TPV but before I could test it, I found information that it's supported only for scheduling to Slurm and we are using PBS through DRMAA, so I stopped. Now I decided to give it another chance, willing to extend the code to have support for PBS or rather DRMAA in general but I'm unable to find the note about Slurm support only. Is it possible that it changed? Is here somebody using resubmission for something else than Slurm, please?

M. Bernt:

Hi, it seems not work for dynamic destinations in general 😢 galaxyproject/galaxy#9747. More precisely only one resubmission seems possible. If I recall correctly the problem is that we do not store the necessary infos in the DB. In the linked PR I started to work on a test case .. but never finished.

Björn Grüning:

One resubmission per destination maybe? sorting-hat, a predecessor of TPV could resubmit a job multiple times: https://github.com/usegalaxy-eu/sorting-hat/blob/0b0758a1b8b72bc0ea5ae198ad2949f4aa16b586/sorting_hat.py#L547 see also here: https://github.com/usegalaxy-eu/sorting-hat/blob/0b0758a1b8b72bc0ea5ae198ad2949f4aa16b586/sorting_hat.py#L460

Nuwan Goonasekera:

My recollection is that resubmission works as long as the destination name is unique. (Notice that sorting hat is using gateway1x, gateway2x etc.). Therefore, resubmitting to a unique destination should work in TPV with a unique name. However, what doesn't work, and I believe M. Bernt tracked this down to a specific line doing destination caching, is resubmitting a dynamic destination to itself. That is, a TPV resubmission cannot be handled by TPV again due to the final destination being cached. Therefore, a workaround could be (I haven't tried this) to define two (or more) different TPV destinations with unique names, but with the same configuration, and have one TPV resubmit to the other.

Björn Grüning:

If we identify this as a problem, someone of us should join the backend-wg and raise this issue to put it on their todo list.
For the time being we could do the same workaround as we did in Sorting-hat

cat-bro:

I have found the same as M. Bernt that resubmission will work exactly once and that anything I have set in tpv for the job will not be accessible the second time around. I have a PR in Aus infrastructure that I'm not quite sure about my implementation usegalaxy-au/infrastructure#2328
Wrt. what Martin D was saying about resubmission and slurm: I think it's possible that the variable memory_limit_reached might only be populated if the slurm job runner has been used, but other conditions could be checked instead (for example looking for the word Killed in job.tool_stderr)

M. Bernt:

Thanks cat-bro. Indeed there are some runner specific things .. I forgot about:

If this is correct it should be pretty simple to add for PBS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant