cns_canu using more memory than requested in slurm #1750

hyphaltip · 2020-06-25T17:08:01Z

My unitig consensus jobs are using more memory than requested in the slurm job so the jobs are getting killed. How can I specify a larger mem size to thse cns_canu jobs running utgcns?

The jobs are getting allocated with ~800m-1gb but I think they need 10x that to run properly.

Command line:
canu -d canu2_6FC.loredac_corrected -p canu2_6FC.loredac genomeSize=900m useGrid=true gridOptions="-p batch" minReadLength=750 -corrected -nanopore 6FC.corrected_loredac.fasta.gz
Version: Canu 2.0

Linux, Linux version 3.10.0-957.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) #1 SMP Thu Nov 8 23:39:32 UTC 2018`
CentOS

From logfile in: unitigging/5-consensus

   /opt/linux/centos/7.x/x86_64/pkgs/miniconda3/4.3.31/bin/perl
   This is perl 5, version 26, subversion 2 (v5.26.2) built for x86_64-linux-thread-multi

Found java:
   /opt/linux/centos/7.x/x86_64/pkgs/java/jdk1.8.0_45/bin/java
   java version "1.8.0_45"

Found canu:
   /bigdata/operations/pkgadmin/opt/linux/centos/7.x/x86_64/pkgs/canu/2.0/Linux-amd64/bin/canu
   Canu 2.0

Running job 1 based on SLURM_ARRAY_TASK_ID=1 and offset=0.
-- Using seqFile '../canu2_6FC.loredac.ctgStore/partition.0001'.
-- Opening tigStore '../canu2_6FC.loredac.ctgStore' version 1.
-- Opening output results file './ctgcns/0001.cns.WORKING'.
--
-- Computing consensus for b=0 to e=848692 with errorRate 0.2000 (max 0.4000) and minimum overlap 40
--
Loading corrected-trimmed reads from seqFile '../canu2_6FC.loredac.ctgStore/partition.0001'
/var/spool/slurmd/job1614851/slurm_script: line 103: 37404 Killed                  $bin/utgcns -R ../canu2_6FC.loredac.${tag}Store/partition
.$jobid -T ../canu2_6FC.loredac.${tag}Store 1 -P $jobid -O ./${tag}cns/$jobid.cns.WORKING -maxcoverage 40 -e 0.2 -pbdagcon -edlib -threads 8
slurmstepd-c26: error: Detected 1 oom-kill event(s) in step 1614851.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The text was updated successfully, but these errors were encountered:

skoren · 2020-06-25T22:57:20Z

Canu does retry after increasing the memory over the initial request. However, it won't increase it 10-fold and it would be quite strange for the memory to be off by that much. What's the actual request Canu is using for these jobs to the grid (in the consensus.jobSubmit-01.sh script)?

hyphaltip · 2020-06-27T04:51:30Z

This was the values generated by canu running - seems like it would have been enough, so I'm not sure.
--cpus-per-task=8 --mem-per-cpu=804m but I think mem-per-cpu wasn't then expanding 8x804m as AFAIK it was

#!/bin/sh

sbatch \
  --cpus-per-task=8 --mem-per-cpu=804m -p batch --mem-per-cpu=16gb -o consensus.%A_%a.out \
  -D `pwd` -J "cns_canu2_6FC.loredac" \
  -a 1-2 \
  `pwd`/consensus.sh 0 \
> ./consensus.jobSubmit-01.out 2>&1

I added --mem-per-cpu=16gb in my gridOptions and it succeeded since this was tacked on it worked, but maybe our slurm config is not quite working?
--cpus-per-task=8 --mem-per-cpu=804m -p batch --mem-per-cpu=16gb

I've run earlier canu versions, on this same dataset in fact, and it never had the mem issue. I know I logged into a machine running a job and it was sitting on 7-8gb request before it was killed. So not sure what should be tweaked to avoid this blanket large memory request.

skoren · 2020-06-28T18:21:57Z

If it was asking for 800mb and 8 cores that would put it around 7-8gb so perhaps it was just under-requesting the memory. There is a retry which increases the memory in case the first request fails. You can specify the --mem-per-cpu=2g to gridOptionsCns or you can try specifying minMemory=16 which should also keep all jobs at 16gb or larger.

…s on contigs. Issue #1750.

brianwalenz · 2020-07-09T11:13:52Z

Fixed a possible cause of this problem.

brianwalenz added a commit that referenced this issue Jul 9, 2020

Fix inconsistency that could underestimate memory needed for consensu…

83b33b4

…s on contigs. Issue #1750.

brianwalenz closed this as completed Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cns_canu using more memory than requested in slurm #1750

cns_canu using more memory than requested in slurm #1750

hyphaltip commented Jun 25, 2020

skoren commented Jun 25, 2020

hyphaltip commented Jun 27, 2020 •

edited by skoren

Loading

skoren commented Jun 28, 2020

brianwalenz commented Jul 9, 2020

cns_canu using more memory than requested in slurm #1750

cns_canu using more memory than requested in slurm #1750

Comments

hyphaltip commented Jun 25, 2020

skoren commented Jun 25, 2020

hyphaltip commented Jun 27, 2020 • edited by skoren Loading

skoren commented Jun 28, 2020

brianwalenz commented Jul 9, 2020

hyphaltip commented Jun 27, 2020 •

edited by skoren

Loading