Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example: Individual sketches for each sequence in the multiFasta file concurrently. #2537

Closed
mr-eyes opened this issue Mar 23, 2023 · 2 comments
Labels

Comments

@mr-eyes
Copy link
Member

mr-eyes commented Mar 23, 2023

import sys
import gzip
from Bio import SeqIO
import sourmash
from sourmash import MinHash
import concurrent.futures

def create_sourmash_sketch(genome, ksizes, scale):
    # Create a list to store MinHash objects for different k-mer sizes
    minhash_list = []

    # Create a MinHash object for each k-mer size
    for ksize in ksizes:
        mh = MinHash(n = 0, scaled=scale, ksize=ksize, track_abundance=False)
        mh.add_sequence(str(genome.seq).upper(), force=True)
        minhash_list.append(mh)

    sigs = [sourmash.SourmashSignature(mh, name=genome.id) for mh in minhash_list]
    with open(f"{genome.id}.sig", "w") as fSig:
        sourmash.save_signatures(sigs, fSig)
    

    return f"Generated sourmash sketch for genome: {genome.id}"

def create_sourmash_sketches(input_fna_gz, num_cores):
    # Read the input .fna.gz file
    # check if file is gzipped or not
    if input_fna_gz.endswith(".gz"):
        with gzip.open(input_fna_gz, "rt") as handle:
            genomes = list(SeqIO.parse(handle, "fasta"))
    else:
        with open(input_fna_gz, "rt") as handle:
            genomes = list(SeqIO.parse(handle, "fasta"))

    # Set the parameters for the MinHash sketch
    ksizes = [21, 31, 51]
    scale = 1000

    # Process genomes concurrently
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_cores) as executor:
        futures = [executor.submit(create_sourmash_sketch, genome, ksizes, scale) for genome in genomes]
        for future in concurrent.futures.as_completed(futures):
            print(future.result())


if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python multifasta_sketch.py <input_fasta> <num_cores>")
        sys.exit(1)

    input_fna_gz = sys.argv[1]
    num_cores = int(sys.argv[2])
    create_sourmash_sketches(input_fna_gz, num_cores)
@ctb
Copy link
Contributor

ctb commented Mar 24, 2023

neat!

A few notes -

  • screed will handle gz, bz2, fasta, and fastq file formats automatically. No need to drag BioPython in!
  • you might also consider using sourmash_args.SaveSignaturesToLocation (see below) instead of sourmash.save_signatures, as it will support many different kinds of output.
  • there is an early stage plugin, sourmash_plugin_sketchall that does something similar! see pull requests in that repo for various options.
  • I found that multithreading was better than multiprocessing, here

using sourmash_args.SaveSignaturesToLocation

    with sourmash_args.SaveSignaturesToLocation(sigfile) as save_sig:
        for sigs in sigslist:
            set_sig_name(sigs, filename, name)
            for ss in sigs:
                save_sig.add(ss)

@ctb
Copy link
Contributor

ctb commented Sep 27, 2024

Closing in favor of sourmash_plugin_branchwater which provides manysketch ... --singleton and is very fast!

@ctb ctb closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants