Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--work-on-disk skips steps #97

Closed
nick-youngblut opened this issue Jun 8, 2022 · 15 comments
Closed

--work-on-disk skips steps #97

nick-youngblut opened this issue Jun 8, 2022 · 15 comments

Comments

@nick-youngblut
Copy link

krakenuniq-build died due to an out-of-memory error:

Found jellyfish v1.1.12
Kraken build set to minimize disk writes.
Finding all library files
Found 10000 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '31172786716'
K-mer set created. [2h55m52.083s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 31067325971 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 31067325971 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [7h34m38.290s]
Creating seqID to taxID map (step 4 of 6)..
1219382 sequences mapped to taxa. [52.317s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 401815 taxa
taxDB construction finished. [2.846s]
Building  KrakenUniq LCA database (step 6 of 6)...
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (347.204 GB) ... Done
Loaded database with 31067325971 keys with k of 31 [val_len 4, key_len 8].
Reading sequence ID to taxonomy ID mapping ...  got 1219382 mappings.
Processed 681 s%

I then tried running krakenuniq-build --work-on-disk, and the job took ~5 seconds:

Kraken build set to minimize RAM usage.
Found 10000 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Creating taxDB (step 5 of 6)...
taxDB construction finished. [2.846s]
Building  KrakenUniq LCA database (step 6 of 6)...

...however, the job never generated the database.kdb output file. If I instead don't use --work-on-disk, krakenuniq-build seems to actually work on producing the database.kdb output:

Found jellyfish v1.1.12
Kraken build set to minimize disk writes.
Found 10000 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Skipping step 5, taxDB exists.
Building  KrakenUniq LCA database (step 6 of 6)...
Reading taxonomy index from taxDB. Done.
Getting database0.kdb into memory (347.204 GB) ...

I'm using krakenuniq=0.6 due to #95

@nick-youngblut
Copy link
Author

As a test of reproducibility, I killed the krakenuniq-build job at the end of the above post (Getting database0.kdb into memory (347.204 GB) ...), and I instead tried used krakenuniq-build --work-on-disk again to make sure that it would generate the same output as above:

Kraken build set to minimize RAM usage.
Found 10000 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Creating taxDB (step 5 of 6)...
taxDB construction finished. [2.846s]
Building  KrakenUniq LCA database (step 6 of 6)...

...however, krakenuniq-build --work-on-disk instead produced the following output:

Found jellyfish v1.1.12
Kraken build set to minimize RAM usage.
Found 10000 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Skipping step 5, taxDB exists.
Building  KrakenUniq LCA database (step 6 of 6)...
Reading taxonomy index from taxDB. Done.
You need to operate in RAM (flag -M) to use output to a different file (flag -o)
xargs: cat: terminated by signal 13

@nick-youngblut
Copy link
Author

I get the same error with krakenuniq=0.6 when starting a krakenuniq-build job on a new library (using --work-on-disk):

Found jellyfish v1.1.12
Kraken build set to minimize RAM usage.
Finding all library files
Found 500 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using jellyfish
Hash size not specified, using '1637986465'
K-mer set created. [8m10.272s]
Skipping step 2, no database reduction requested.
Sorting k-mer set (step 3 of 6)...
db_sort: Getting database into memory ...Loaded database with 1623560677 keys with k of 31 [val_len 4, key_len 8].
Loaded database with 1623560677 keys with k of 31 [val_len 4, key_len 8].
db_sort: Sorting ...db_sort: Sorting complete - writing database to disk ...
K-mer set sorted. [22m46.406s]
Creating seqID to taxID map (step 4 of 6)..
61039 sequences mapped to taxa. [3.395s]
Creating taxDB (step 5 of 6)...
Building taxonomy index from taxonomy//nodes.dmp and taxonomy//names.dmp. Done, got 401815 taxa
taxDB construction finished. [3.468s]
Building  KrakenUniq LCA database (step 6 of 6)...
Reading taxonomy index from taxDB. Done.
You need to operate in RAM (flag -M) to use output to a different file (flag -o)
xargs: cat: terminated by signal 13

@alekseyzimin
Copy link
Collaborator

Please check --work-on-disk option in the latest release v0.7.3, it should work properly now.

@nick-youngblut
Copy link
Author

With v0.7.3, I'm still getting the error described at #52. My build directory includes:

database-build.log
database.jdb
database0.kdb
database_0
database_1
library/
library-files.txt
seqid2taxid-plus.map
seqid2taxid.map
taxDB
taxonomy/

@alekseyzimin
Copy link
Collaborator

alekseyzimin commented Jun 23, 2022 via email

@nick-youngblut
Copy link
Author

A simple ./krakenuniq-build --kmer-len 31 --build --threads 12 --db $DB, with $DB denoting the database base directory path.

Using --rebuild does not help (just checked again)

@alekseyzimin
Copy link
Collaborator

alekseyzimin commented Jun 23, 2022 via email

@nick-youngblut
Copy link
Author

nick-youngblut commented Jun 23, 2022

I tried krakenuniq-build --db . --threads 32 --work-on-disk in the appropriate directory, but I still got the same error.

Maybe it's due to how I'm adding genomes to the library? My simple helper script for that:

#!/usr/bin/env python
from __future__ import print_function
import os
import sys
import re
import gzip
import bz2
import argparse
import logging

# logging
logging.basicConfig(format='%(asctime)s - %(message)s', level=logging.DEBUG)

# argparse
class CustomFormatter(argparse.ArgumentDefaultsHelpFormatter,
                      argparse.RawDescriptionHelpFormatter):
    pass

desc = 'Adding genome to krakenuniq database'
epi = """DESCRIPTION:
Write output files to db_dir:
* renamed genome fasta (all special characters removed from names)
* krakenuniq map file
"""
parser = argparse.ArgumentParser(description=desc, epilog=epi,
                                 formatter_class=CustomFormatter)
parser.add_argument('fasta_file', type=str,
                    help='Input genome fasta file')
parser.add_argument('taxid', type=str,
                    help='Taxonomy ID for the genome')
parser.add_argument('sample', type=str,
                    help='Genome name')
parser.add_argument('db_dir', type=str,
                    help='Output database location (e.g., ku_db/library/)')
parser.add_argument('--version', action='version', version='0.0.1')

# functions
def _open(infile, mode='rb'):
    """
    Openning of input, regardless of compression
    """
    if infile.endswith('.bz2'):
        return bz2.open(infile, mode)
    elif infile.endswith('.gz'):
        return gzip.open(infile, mode)
    else:
        return open(infile)

def copy_genome(infile, outdir, sample):
    outfile = os.path.join(outdir, sample + '.fna')
    regex = re.compile(r'[^>A-Za-z0-9-\n]')
    gz = infile.endswith('.gz')
    contigs = list()
    with _open(infile) as inF, open(outfile, 'w') as outF:
        for line in inF:
            if gz:
                line = line.decode('utf-8')
            # seq header
            if line.startswith('>'):
                line = regex.sub('_', line)
                contigs.append(line.lstrip('>').rstrip())
            # writing to output directory
            outF.write(line)
    logging.info(f'File written: {outfile}')
    # return
    return contigs

def write_map(contigs, outdir, sample, taxid):
    outfile = os.path.join(outdir, sample + '.map')
    with open(outfile, 'w') as outF:
        for contig in contigs:
            outF.write('\t'.join([contig, taxid, sample]) + '\n')
    logging.info(f'File written: {outfile}')

## main interface function
def main(args):
    if not os.path.isdir(args.db_dir):
        os.makedirs(args.db_dir)
    contigs = copy_genome(args.fasta_file, args.db_dir, args.sample)
    write_map(contigs, args.db_dir, args.sample, args.taxid)

## script main
if __name__ == '__main__':
    args = parser.parse_args()
    main(args)

@alekseyzimin
Copy link
Collaborator

alekseyzimin commented Jun 23, 2022 via email

@nick-youngblut
Copy link
Author

I tried creating a new krakenuniq library, and now I'm getting the following:

krakenuniq-build  --kmer-len 31  --build --threads 12           --db $DB
Kraken build set to minimize disk writes.
Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Creating k-mer set (step 1 of 6)...
Using /tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish
Hash size not specified, using '32573424'
/tmp/global2/nyoungblut/code/dev/Struo2/bin/scripts/krakenuniq/jellyfish-install/bin/jellyfish: error while loading shared libraries: libjellyfish-1.1.so.1: cannot open shared object file: No such file or directory

I installed krakenuniq v0.7.3 via:

git clone https://github.com/fbreitwieser/krakenuniq
cd krakenuniq
./install_krakenuniq /PATH/TO/INSTALL_DIR

...since that version isn't on bioconda yet

@alekseyzimin
Copy link
Collaborator

alekseyzimin commented Jun 23, 2022 via email

@alekseyzimin
Copy link
Collaborator

alekseyzimin commented Jun 23, 2022 via email

@nick-youngblut
Copy link
Author

Yeah, the path was just messed up.

The run worked:

Kraken build set to minimize RAM usage.
Found 10 sequence files (*.{fna,fa,ffn,fasta,fsa}) in the library directory.
Skipping step 1, k-mer set already exists.
Skipping step 2, no database reduction requested.
Skipping step 3, k-mer set already sorted.
Skipping step 4, seqID to taxID map already complete.
Skipping step 5, taxDB exists.
Skipping step 6, LCAs already set.
Database construction complete. [Total: 0.014s]
You can delete all files but database.{kdb,idx} and taxDB now, if you want

...but I the set_lcas: unable to open database.idx: No such file or directory is generated if you try to re-build the database after building (or attempting to build) the database once

@alekseyzimin
Copy link
Collaborator

alekseyzimin commented Jun 23, 2022 via email

@nick-youngblut
Copy link
Author

Yep, that fixed the issue. Thanks @alekseyzimin for all of your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants