Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cns_canu using more memory than requested in slurm #1750

Closed
hyphaltip opened this issue Jun 25, 2020 · 4 comments
Closed

cns_canu using more memory than requested in slurm #1750

hyphaltip opened this issue Jun 25, 2020 · 4 comments

Comments

@hyphaltip
Copy link
Contributor

My unitig consensus jobs are using more memory than requested in the slurm job so the jobs are getting killed. How can I specify a larger mem size to thse cns_canu jobs running utgcns?

The jobs are getting allocated with ~800m-1gb but I think they need 10x that to run properly.

Command line:
canu -d canu2_6FC.loredac_corrected -p canu2_6FC.loredac genomeSize=900m useGrid=true gridOptions="-p batch" minReadLength=750 -corrected -nanopore 6FC.corrected_loredac.fasta.gz
Version: Canu 2.0

Linux, Linux version 3.10.0-957.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Thu Nov 8 23:39:32 UTC 2018`
CentOS

From logfile in: unitigging/5-consensus

   /opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/bin/perl
   This is perl 5, version 26, subversion 2 (v5.26.2) built for x86_64-linux-thread-multi

Found java:
   /opt/linux/centos/7.x/x86_64/pkgs/java/jdk1.8.0_45/bin/java
   java version "1.8.0_45"

Found canu:
   /bigdata/operations/pkgadmin/opt/linux/centos/7.x/x86_64/pkgs/canu/2.0/Linux-amd64/bin/canu
   Canu 2.0

Running job 1 based on SLURM_ARRAY_TASK_ID=1 and offset=0.
-- Using seqFile '../canu2_6FC.loredac.ctgStore/partition.0001'.
-- Opening tigStore '../canu2_6FC.loredac.ctgStore' version 1.
-- Opening output results file './ctgcns/0001.cns.WORKING'.
--
-- Computing consensus for b=0 to e=848692 with errorRate 0.2000 (max 0.4000) and minimum overlap 40
--
Loading corrected-trimmed reads from seqFile '../canu2_6FC.loredac.ctgStore/partition.0001'
/var/spool/slurmd/job1614851/slurm_script: line 103: 37404 Killed                  $bin/utgcns -R ../canu2_6FC.loredac.${tag}Store/partition
.$jobid -T ../canu2_6FC.loredac.${tag}Store 1 -P $jobid -O ./${tag}cns/$jobid.cns.WORKING -maxcoverage 40 -e 0.2 -pbdagcon -edlib -threads 8
slurmstepd-c26: error: Detected 1 oom-kill event(s) in step 1614851.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
@skoren
Copy link
Member

skoren commented Jun 25, 2020

Canu does retry after increasing the memory over the initial request. However, it won't increase it 10-fold and it would be quite strange for the memory to be off by that much. What's the actual request Canu is using for these jobs to the grid (in the consensus.jobSubmit-01.sh script)?

@hyphaltip
Copy link
Contributor Author

hyphaltip commented Jun 27, 2020

This was the values generated by canu running - seems like it would have been enough, so I'm not sure.
--cpus-per-task=8 --mem-per-cpu=804m but I think mem-per-cpu wasn't then expanding 8x804m as AFAIK it was

#!/bin/sh

sbatch \
  --cpus-per-task=8 --mem-per-cpu=804m -p batch --mem-per-cpu=16gb -o consensus.%A_%a.out \
  -D `pwd` -J "cns_canu2_6FC.loredac" \
  -a 1-2 \
  `pwd`/consensus.sh 0 \
> ./consensus.jobSubmit-01.out 2>&1

I added --mem-per-cpu=16gb in my gridOptions and it succeeded since this was tacked on it worked, but maybe our slurm config is not quite working?
--cpus-per-task=8 --mem-per-cpu=804m -p batch --mem-per-cpu=16gb

I've run earlier canu versions, on this same dataset in fact, and it never had the mem issue. I know I logged into a machine running a job and it was sitting on 7-8gb request before it was killed. So not sure what should be tweaked to avoid this blanket large memory request.

@skoren
Copy link
Member

skoren commented Jun 28, 2020

If it was asking for 800mb and 8 cores that would put it around 7-8gb so perhaps it was just under-requesting the memory. There is a retry which increases the memory in case the first request fails. You can specify the --mem-per-cpu=2g to gridOptionsCns or you can try specifying minMemory=16 which should also keep all jobs at 16gb or larger.

brianwalenz added a commit that referenced this issue Jul 9, 2020
@brianwalenz
Copy link
Member

Fixed a possible cause of this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants