Wait_finished method for job API (regarding #240) #242

JonaOtto · 2022-08-11T14:14:42Z

Hello PySlurm developers,
This is my proposal for a method to wait for a job to finish, regarding #240, for the 21.08 version.

I added the wait_finished method to the job class. It is implemented using the find_id method of the job class. I also added a test method. I've tested this in the container, as mentioned in the README. At least the test for wait_finished passes (although some others are not, which also fail in the version without my changes). Sadly, it turns out the cluster I am able to use, does not run a SLURM version, which I can use with PySlurm. Therefore, I could not test it on the real cluster (but we will get a SLURM update in the next weeks, maybe I will try this again over with a different version).

I'm looking forward to your thoughts and comments!

Best,
Jonathan

tazend · 2022-08-11T17:01:10Z

Hi @JonaOtto

thanks for the addition. I think instead of merging into the 21.08 branch, it would be better to directly merge into main. After that, I can simply backport it to the 21.08 branch (and maybe even older branches). I'd say its better this way around.

main is 22.05, though from looking at the code, it should definitely work just as in 21.08. I have a 22.05 installation, so I can test it

tazend · 2022-08-11T20:16:19Z

Hi @JonaOtto

just had another look at it. I think there might be a little problem with the logic though.

The problem is that the only condition tested for here is the state of COMPLETED. A job that has finished is however not always in a state of COMPLETED, there are multiple different ending states.
For example, one of them might be a state of TIMEOUT.

If you call wait_finished on a job and it lands in a state TIMEOUT, the function will continue to block, even if the job has technically ended (in a finished state), so long until the job is purged from the slurm controllers internal memory (which is a configurable time-window, default I think 300 seconds). When the job is purged, then the function will yield an error, since it is still TIMEOUT != COMPLETED, but the job then doesn't exist anymore and find_id raises a ValueError Exception (which might be fine, though in this case it would solely occur because it wasn't correctly determined that the job actually ended)

Instead, it would be more benefitial to work with the raw job-pointer, and therefore the raw state info. I think we just need to extract the slurm_load_job logic from find_id into a seperate internal function to simply load the job and get access to the raw job-pointer, instead of the parsed dict. So basically this (not complete):

cdef _load_single_job(jobid):
        cdef:
            int apiError
            int rc

        if isinstance(jobid, int) or isinstance(jobid, long):
            jobid = str(jobid).encode("UTF-8")
        else:
            jobid = jobid.encode("UTF-8")

        jobid_xlate = slurm.slurm_xlate_job_id(jobid)
        rc = slurm.slurm_load_job(&self._job_ptr, jobid_xlate, self._ShowFlags)

        ...error handling...

With the pointer, we can then do:

while not IS_JOB_FINISHED(self._job_ptr):
    ...

The IS_JOB_FINISHED function correctly determines if the job is in a finished state.

What do you think?

JonaOtto · 2022-08-11T20:47:07Z

Hi @tazend,
Yeah, I actually had a similar implementation tested yesterday, which used the IS_JOB_COMPLETED function. But to get to the pointer, I had to (re)use a lot of the logic like it is in find_id (like the snippet you wrote above). And since I did not know/thought about these different finished states, I did not catch what was different from doing it the "python"-way and thought it was no different. What you say sounds good to me, and it is probably a good idea to extract the slurm_load_job functionality into a __load_single_job or similar 👍 This way it should work correctly and it will not be a code repetition. I will try to correct it and update the branch/PR accordingly.

EDIT: typo

…oad_job functionality.

…e of the job(s), instead of None. Fixed some typos in comments.

JonaOtto · 2022-08-18T15:15:49Z

Hello pyslurm devs,
I have pushed my updates for the wait_finished method. It now uses the raw job pointer, as discussed before.
I also did two additional things:

To use the [0]-index not really makes sense with the job pointer I guess, since in case of a job array in the job, it will only wait for the first job of them. I extended the method to also work with job arrays and also added a new test method for this.
Because I thought it is convenient and to match the semantics of sbatch --wait, I made wait_finished return the exit code, instead of None. In case of a job array, it will be the highest of all exit codes from the jobs in the job array (like sbatch --wait does it).

I tested this in the container, again. Like before, the tests for wait_finished are passing, some others are not. Since we actually got an update to Slurm 21.08 on our cluster yesterday, I ran some example test scripts which used pyslurm with the wait_finished (and other) methods, like I intend to do, and everything seems working for me. I've tested it with multiple configurations of single jobs, arrays, 0 and non-zero exit codes.

Let me know what you think :)
best regards
Jonathan

tazend · 2022-09-04T19:16:39Z

Hi @JonaOtto

Except for the one thing I mentioned with the potential memory-leak which needs to be fixed, everything else looks good.
Additionally I'd like to have this directly into the main branch (instead of 21.08) and from there I can backport it to the 21.08 branch (seems better this way around). I think the target branch of the PR can be changed and it should still work.

JonaOtto · 2022-09-07T08:53:09Z

Hi @tazend,
Sure, base branch is changed. If this not works for some reason, let me know, I could copy my changes to main and open a new PR (I think source branches cannot be changed).
About the potential memory leak: I'm sorry, but I'm not sure what mention you reference to. I guess I miss something, but I cannot work out what it is at the moment. Should I introduce slurm.slurm_free_job_info_msg(self._job_ptr) into the end of wait_finished?

tazend · 2022-09-08T16:52:13Z

Hi @JonaOtto

I mentioned the memory leak in line 2921 (I hope my comment shows up there as review?)

Yes, after line 2934 - still within the while loop, but after the for-loop, you need to add slurm.slurm_free_job_info_msg(self._job_ptr).
Without that and with each new try to see if the job has finished, more and more memory will be leaked due to not cleaning up the old job-pointer during the while-loop.

JonaOtto · 2022-09-08T17:29:01Z

Hi @tazend

Oh okay, maybe that's on me, I'm still a noob when it comes to all these cool GitHub features. I do not see your comment, but maybe I did not know where to look for.

Anyway, we thought about the same thing anyway, give me a minute, I will do the fix.

EDIT: typo

JonaOtto · 2022-09-08T17:31:45Z

Okay, fix pushed. There should now be pairs of _load_single_job and slurm_free_job_info_msg each. Thanks for all your help @tazend!

tazend · 2022-09-09T17:29:14Z

@JonaOtto

alright, everything looks good!
Yeah it was my fault, I added the comment but didn't actually submit the review so it wouldn't show up for you.

Anyway, we can merge it now, thanks for the contribution :)

JonaOtto · 2022-09-10T07:00:56Z

Thanks for merging :)

* Fix introduced typo in partition information dictionary key. (#241) * Added wait_finished method to job class (#240). * Added test method for wait_finished method of the job class. * Added _load_single_job method to the job class to extract the slurm_load_job functionality. * Updated find_id and wait_finished to use _load_single_job. Co-authored-by: Jonathan Goodson <[email protected]>

jgoodson and others added 3 commits August 8, 2022 19:53

Fix introduced typo in partition information dictionary key. (#241)

4e61c8c

Added wait_finished method to job class.

e08add6

Added test method for wait_finished method of the job class.

f4137be

tazend self-requested a review August 11, 2022 16:53

JonaOtto added 5 commits August 16, 2022 09:04

Added _load_single_job method to the job class to extract the slurm_l…

6021f57

…oad_job functionality.

Fixed error in _load_single_job.

03a4b57

Updated find_id and wait_finished to use _load_single_job.

a8226ce

Updated wait_finished to work with job arrays and return the exit cod…

6f4933f

…e of the job(s), instead of None. Fixed some typos in comments.

Updated and added tests for wait_finished.

9eda1b8

JonaOtto changed the base branch from 21.08 to main September 7, 2022 08:41

Fix: Memory leak, added slurm_free_job_info_msg.

117c282

tazend approved these changes Sep 9, 2022

View reviewed changes

tazend merged commit f110bf5 into PySlurm:main Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait_finished method for job API (regarding #240) #242

Wait_finished method for job API (regarding #240) #242

JonaOtto commented Aug 11, 2022

tazend commented Aug 11, 2022

tazend commented Aug 11, 2022

JonaOtto commented Aug 11, 2022 •

edited

Loading

JonaOtto commented Aug 18, 2022

tazend commented Sep 4, 2022

JonaOtto commented Sep 7, 2022 •

edited

Loading

tazend commented Sep 8, 2022

JonaOtto commented Sep 8, 2022 •

edited

Loading

JonaOtto commented Sep 8, 2022

tazend commented Sep 9, 2022

JonaOtto commented Sep 10, 2022

Wait_finished method for job API (regarding #240) #242

Wait_finished method for job API (regarding #240) #242

Conversation

JonaOtto commented Aug 11, 2022

tazend commented Aug 11, 2022

tazend commented Aug 11, 2022

JonaOtto commented Aug 11, 2022 • edited Loading

JonaOtto commented Aug 18, 2022

tazend commented Sep 4, 2022

JonaOtto commented Sep 7, 2022 • edited Loading

tazend commented Sep 8, 2022

JonaOtto commented Sep 8, 2022 • edited Loading

JonaOtto commented Sep 8, 2022

tazend commented Sep 9, 2022

JonaOtto commented Sep 10, 2022

JonaOtto commented Aug 11, 2022 •

edited

Loading

JonaOtto commented Sep 7, 2022 •

edited

Loading

JonaOtto commented Sep 8, 2022 •

edited

Loading