too many open files #118

RichardCorbett · 2015-02-06T16:39:42Z

Hi.
I've been trying to merge and dup mark a large data set. Once merged, the coverage will be about 120 X.

Each time I try, I have been getting the following error (or something similar):
"sambamba-markdup: sambamba_testing/sambamba-pid23155-nwfz/sorted.509.bam.vo: Too many open files"

I noticed in the help that I could reduce the number of open files by specifying a larger value for "--overflow-list-size". However, I still get the same error.

Here is the command I've been using - can you point out anything I can change to get past the error?
"sambamba_v0.5.1 merge /dev/stdout P*bam | sambamba_v0.5.1 markdup --overflow-list-size 1000000 --tmpdir sambamba_testing /dev/stdin sambamba_marked.bam

finding positions of the duplicate reads in the file...
sambamba-markdup: sambamba_testing/sambamba-pid23155-nwfz/sorted.509.bam.vo: Too many open files
"

thanks,
Richard

lomereiter · 2015-02-06T17:01:43Z

Hi Richard,

The large number (509) indicates that many paired reads are extremely far apart in the file (or there are no mates at all, e.g. all reads have the same direction or not properly named).
Please check that the paired reads don't have suffixes like '/1'.

RichardCorbett · 2015-02-06T17:24:14Z

Hi,
Thanks for the quick reply. The reads don't have any weird suffixes (no /1, etc.) and we can merge/dupmark with Picard without any problems (just slowly).

Since this is a cancer sample, there are lots of regions with very high coverage. The average coverage is just over 100, but there are lots of regions higher than 1000X. In total there are over 3 billion reads. The alignment rate is above 95% and over 98% of the aligned reads are mapped in proper pairs with a mean insert size of 415bp. By our standards, these are pretty good stats for a human genome library.

Any ideas?

lomereiter · 2015-02-06T17:38:43Z

Try to increase --hash-table-size as well, up to 1000000 or even 10 million if there's enough RAM (assuming a record occupies 1kb, it's 1 to 10 GB in total). That should help with the high-coverage regions.
(By the way, do you, by any chance, have a rough figure of Picard memory consumption on this dataset?)

RichardCorbett · 2015-02-06T18:00:36Z

When we ran Picard we capped the RAM usage at 25Gigs. It used all of that and likely would have used a lot more had we allowed.

Using the sambamba pipe like I showed above I've seen each process use 50Gigs. If we move this to our production pipe, we'll need to make it so we can merge and mark dups 60Gigs total. Do you think that is possible?

lomereiter · 2015-02-06T18:13:33Z

Even 30Gigs total should be possible. I've fixed a few leaks recently (#116), so top consumption of the latest binary build should be significantly lower.

RichardCorbett · 2015-02-06T19:30:27Z

thanks, I'll give it a whirl and let you know what I find.

RichardCorbett · 2015-02-10T16:07:08Z

Hi again,

Looks like I got farther this time, but ended up with a different error:

"time ./sambamba_02_02_2015 merge /dev/stdout P*bam | ./sambamba_02_02_2015 markdup --overflow-list-size 1000000 --hash-table-size 1000000 --tmpdir sambamba_testing /dev/stdin sambamba_marked.bam
finding positions of the duplicate reads in the file...
sorting 1377861776 end pairs... done in 301371 ms
sorting 23102470 single ends (among them 8026 unmatched pairs)... done in 2543 ms
collecting virtual offsets of duplicate reads... done in 141731 ms
found 82954291 duplicates, sorting the list... done in 4386 ms
collected list of positions in 863 min 26 sec
sambamba-markdup: Error reading BGZF block starting from offset 0: wrong BGZF magic
"

lomereiter · 2015-02-10T16:18:04Z

Ouch. The streaming input is not supported by this tool, it makes a list of file offsets and then reads the file again. Sorry for wasting 15h of computational time. I'm closing this issue and opening another one, regarding the documentation.

lomereiter closed this as completed Feb 10, 2015

lomereiter mentioned this issue Oct 13, 2015

MarkDup on Merge error #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many open files #118

too many open files #118

RichardCorbett commented Feb 6, 2015

lomereiter commented Feb 6, 2015

RichardCorbett commented Feb 6, 2015

lomereiter commented Feb 6, 2015

RichardCorbett commented Feb 6, 2015

lomereiter commented Feb 6, 2015

RichardCorbett commented Feb 6, 2015

RichardCorbett commented Feb 10, 2015

lomereiter commented Feb 10, 2015

too many open files #118

too many open files #118

Comments

RichardCorbett commented Feb 6, 2015

lomereiter commented Feb 6, 2015

RichardCorbett commented Feb 6, 2015

lomereiter commented Feb 6, 2015

RichardCorbett commented Feb 6, 2015

lomereiter commented Feb 6, 2015

RichardCorbett commented Feb 6, 2015

RichardCorbett commented Feb 10, 2015

lomereiter commented Feb 10, 2015