Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submitting batch job fails randomly with broken paths #260

Closed
Sideboard opened this issue Feb 7, 2023 · 9 comments
Closed

Submitting batch job fails randomly with broken paths #260

Sideboard opened this issue Feb 7, 2023 · 9 comments

Comments

@Sideboard
Copy link

Sideboard commented Feb 7, 2023

Details

  • Slurm Version: 20.11.9
  • Python Version: 3.9.6
  • Cython Version: 0.29.33
  • PySlurm Branch: tag: v20.11.8-1
  • Linux Distribution: CentOS Linux 7 (Core)

Issue

Submitting jobs (both via script or wrap) fails randomly. An immediate indicator is that the work_dir (and other paths like std_out and std_err) are broken strings on those cases:

>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727734 PENDING /correct/work/dir
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727735 PENDING ��͜�
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727736 PENDING ��͜�
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727737 PENDING ��͜�
>>> psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5'}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])
7727738 PENDING /correct/work/dir

For a failing job:

>>> bytes(job['work_dir'], 'utf-8')
b'\xef\xbf\xbd\xef\xbf\xbd/\xef\xbf\xbd\xef\xbf\xbd\x7f'
>>> bytes(job['std_out'], 'utf-8')
b'\xef\xbf\xbd\xef\xbf\xbd/\xef\xbf\xbd\xef\xbf\xbd\x7f/slurm-7736231.out'
>>> bytes(job['std_err'], 'utf-8')
b'\xef\xbf\xbd\xef\xbf\xbd/\xef\xbf\xbd\xef\xbf\xbd\x7f/slurm-7736231.out'

Any idea what is going wrong?

@Sideboard Sideboard changed the title Submitting batch jobs fails randomly with broken paths Submitting batch jobs fail randomly with broken paths Feb 7, 2023
@Sideboard Sideboard changed the title Submitting batch jobs fail randomly with broken paths Submitting batch job fails randomly with broken paths Feb 7, 2023
@tazend
Copy link
Member

tazend commented Feb 7, 2023

Hi,

could you try out the slurm-20.11.8 branch? (https://github.com/PySlurm/pyslurm/tree/slurm-20.11.8)
It is some commits ahead of the tag, perhaps this can do the trick.

So you submitted the jobs from the directory /correct/work/dir, right? Does scontrol show job <id> show the correct paths, or is it also broken there?
I'll also try to reproduce it on my side.

@Sideboard
Copy link
Author

The strings are also broken in sacct and scontrol.

$ scontrol show job 7750363
⋮
   WorkDir=���*�
   StdErr=���*�/slurm-7750363.out
   StdIn=/dev/null
   StdOut=���*�/slurm-7750363.out
⋮

@Sideboard
Copy link
Author

I switched to branch slurm-20.11.8 but that did not help. Could it be a mismatch with C string lengths? How can I debug this?

submit_batch_job() also ignores the option 'time': '00:02:00'. It uses the default time instead. Could there be a connection?

pyslurm.job().submit_batch_job({
    'wrap': 'sleep 60',
    'time': '00:02:00',
})

@mcsloy
Copy link

mcsloy commented Feb 8, 2023

Looking at the byte data you can see the reoccurring byte pattern EF BF BD. This is the UTF-8 hex encoding for the special character REPLACEMENT CHARACTER, see this link for more info. This character is used during encoding and decoding to replace erroneous data. For example FF FF FF is not a valid UTF-8 character and will thus be replaced with EF BF BD during an encoding/decoding attempt. An error is not raised during execution due to the use of replacement based error handling in PySlurm; i.e. .encode("UTF-8", "replace"). While these characters are emerging during the encoding/decoding process the non-deterministic nature of the error suggest that the encoding issue is only a symptom. The most likely culprit is either an overflow error or the memory location being freed up before the code is actually done with the data stored there.

I suspect this issue will be localised to the work_dir variable only. With the std_out, std_err errors being due to them likely being created by appending other strings to the erroneous work_dir.

@tazend
Copy link
Member

tazend commented Feb 8, 2023

Hi,

yeah culprit is definitely work_dir, with std_err and std_out only being side-effects since the slurmctld puts the logs per default into the work-dir.

The encoding step itself should be fine, however it has likely to do with the lifetime of the char* pointer for work_dir, since this is done in the code (if no work-dir has been specified and we just use the current work dir as default, as sbatch does):

cwd = os.getcwd().encode("UTF-8", "replace")
desc.work_dir = cwd

This itself is fine, however this code is in a different function than the one actually submitting the job. By the time the function (fill_job_desc_from_opts) which contains this code is done, desc.work_dir is basically undefined, because the lifetime of desc.work_dir is tied to cwd (a python object) which has no more references and pythons garbage collector may free the memory up at any time - hence the random fails, sometimes it survives long enough, sometimes is doesn't.

You won't see this behaviour though when you explicitly specify the work_dir - the python object will live long enough since it is in the dict you are specifying on executing submit_batch_job.

Anyway, in this case a quick fix in the code would be to modify the incoming dict with the user-options and manually insert the work_dir if nothing is specified, setting it to the output of getcwd and using this for encoding, as it will live long enough. Not really the nicest, since we manipulate the incoming input , but will suffice in this case. I can make a fix for it.

The long-term fix would be to restructure the job API in a way that things like these can't happen anymore (working on it)

The time option from sbatch is actually called time_limit in pyslurm.

@Sideboard
Copy link
Author

Sideboard commented Feb 9, 2023

The problem still persist if work_dir is included in the job options:

Submitted job 7762983 with {'wrap': 'sleep 5', 'work_dir': '/my/work/dir', 'get_user_env_time': -1}
7762983 | PENDING | 2880 | ��+���

So even the job_opts dict is garbage collected? Or at least in the context of a flask application. ⏳
No, it makes sense. Since in both cases a new object is created through encode() and mapped to desc.work_dir within fill_job_desc_from_opts().

@Sideboard
Copy link
Author

The time option from sbatch is actually called time_limit in pyslurm.

Oh, time_limit, thanks. I thought it was time as with sbatch --time since the docstring for submit_batch_job says:

Submit batch job.
* make sure options match sbatch command line opts and not struct member names.

@tazend
Copy link
Member

tazend commented Feb 9, 2023

Mh weird,

I can replicate the erroneous symbols if I don't supply a work_dir, however explicitly setting it works for me:

import pyslurm; psj = pyslurm.job() ; jid = psj.submit_batch_job({'wrap': 'sleep 5', 'work_dir': '/my/work/dir', 'get_user_env_time': -1}) ; job = psj.find_id(jid)[0] ; print(jid, job['job_state'], job['work_dir'])

mh - wondering why its not working for you with that. (I'm on 22.05, though it is still the same code in pyslurm)

@tazend
Copy link
Member

tazend commented Dec 12, 2024

Starting from Slurm 21.8, the Job submission API has been heavily reworked, where such errors are fixed. The old pyslurm.job class in pyslurm.pyx is no longer supported.

The new class for Job-Submission is pyslurm.JobSubmitDescrition. Documentation can be found here: https://pyslurm.github.io/23.2/reference/jobsubmitdescription/

Slurm 20.X versions are too old to justify the time-invest needed to backport the new API. Since 20.X is an old version anyway that SchedMD doesn't support anymore, a newer version of Slurm should be used.

@tazend tazend closed this as completed Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants