Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cori outdir and add architecture argument #26

Merged
merged 7 commits into from
Feb 19, 2021
Merged

Conversation

rly
Copy link
Collaborator

@rly rly commented Feb 11, 2021

  1. Expand $CSCRATCH env variable on cori
  2. Allow setting of cori architecture

@rly rly requested a review from ajtritt February 11, 2021 06:39
@rly
Copy link
Collaborator Author

rly commented Feb 11, 2021

With these changes, I can run: deep-index train-job --cori -P m3513 -a haswell -t 00:15:00 --debug ar122_r95.genomic.medium.deep_index.input.h5 test.sh

However, I get this error:

more train.39070106.log
srun: fatal: Can not execute deep-index

Any ideas?

Never mind. Solved that.

@rly
Copy link
Collaborator Author

rly commented Feb 11, 2021

Now I am getting an h5py error:

['--slurm', '-d', '-M', '-b', '64', '-g', '4', '-n', '1', '-o', '256', '-W', '4000', '-S', '4000', '-r', '0.001', '-A', '1', '-e', '10', '-s', '1101233524', '-E', 'n1_g4_A1_b64_r0.001_o256', 'roznet', '/global/u1/r/rly/ar122_r95.genomic.medium.deep_index.input.h5', '/global/cscratch1/sd/rly/exabiome/deep-index/train/datasets/default/chunks_W4000_S4000/roznet/M/n1_g4_A1_b64_r0.001_o256/train.39070692']
Traceback (most recent call last):
  ...
  File "/global/cscratch1/sd/rly/env/deeptaxon/lib/python3.8/site-packages/hdmf/backends/hdf5/h5tools.py", line 684, in open
    self.__file = File(self.source, open_flag, **kwargs)
  File "/global/cscratch1/sd/rly/env/deeptaxon/lib/python3.8/site-packages/h5py-2.10.0-py3.8-linux-x86_64.egg/h5py/_hl/files.py", line 406, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/global/cscratch1/sd/rly/env/deeptaxon/lib/python3.8/site-packages/h5py-2.10.0-py3.8-linux-x86_64.egg/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 524, error message = 'Unknown error 524')
srun: error: nid01913: task 0: Exited with exit code 1
srun: Terminating job step 39070692.0

Any ideas? It seems like a configuration error. I tried adding module load cray-hdf5 without success.

Here is my test.sh:

#!/bin/bash
#SBATCH -q debug
#SBATCH -A m3513
#SBATCH -t 00:15:00
#SBATCH -n 1
#SBATCH -o /global/cscratch1/sd/rly/exabiome/deep-index/train/datasets/default/chunks_W4000_S4000/roznet/M/n1_g4_A1_b64_r0.001_o256/train.%j.lsf_log #SBATCH -e /global/cscratch1/sd/rly/exabiome/deep-index/train/datasets/default/chunks_W4000_S4000/roznet/M/n1_g4_A1_b64_r0.001_o256/train.%j.lsf_log #SBATCH -C haswell

conda activate /global/cscratch1/sd/rly/env/deeptaxon

module load cray-hdf5

JOB="$SLURM_JOB_ID"
OPTIONS="-d -M -b 64 -g 4 -n 1 -o 256 -W 4000 -S 4000 -r 0.001 -A 1 -e 10 -s 1101233524 -E n1_g4_A1_b64_r0.001_o256"
OUTDIR="/global/cscratch1/sd/rly/exabiome/deep-index/train/datasets/default/chunks_W4000_S4000/roznet/M/n1_g4_A1_b64_r0.001_o256/train.$JOB"
INPUT="/global/u1/r/rly/ar122_r95.genomic.medium.deep_index.input.h5"
LOG="$OUTDIR.log"
CMD="deep-index train --slurm $OPTIONS roznet $INPUT $OUTDIR"

cp $0 $OUTDIR.sh
mkdir -p $OUTDIR
srun $CMD > $LOG 2>&1

@ajtritt
Copy link
Collaborator

ajtritt commented Feb 11, 2021

I'm not sure what that is. Can you try moving the input file to CSCRATCH?

@ajtritt ajtritt merged commit d7f54dd into master Feb 19, 2021
@ajtritt ajtritt deleted the fix/cori_paths branch February 28, 2022 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants