Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate "index" and "search" into substeps for nextflow pipeline #10

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

olgabot
Copy link
Contributor

@olgabot olgabot commented Feb 25, 2025

Addresses #8 and #7. Separating out these steps will make writing Nextflow pipelines easier because while having the commands be wrapped is nice, currently, some of the steps take a long time (e.g. #9 has been running for ~7 days, boo) and it would be good to have them be separate Nextflow processes in https://github.com/seanome/nf-core-kmerseek/

Copy link

→ Task

@olgabot
Copy link
Contributor Author

olgabot commented Feb 28, 2025

Here's what the CLI looks like so far. Is it clear that index wraps all the index-* commands?

(kmerseek-dev) 
 Fri 28 Feb - 09:34  ~/code/kmerseek   origin ☊ olgabot/separate-index-search-substeps ↑2 1☀ 2● ‒3± ⚑ 
  kmerseek --help
Usage: kmerseek [OPTIONS] COMMAND [ARGS]...

  Kmerseek performs efficient protein domain annotation search with reduced
  amino acid k-mers

Options:
  --help  Show this message and exit.

Commands:
  index                  Prepare a database index for searching with k-mers
  index-create-kmers-pq  Substep of index: Extract k-mer sequences and...
  index-create-rocksdb   Substep of index: Creates RocksDB index for fast...
  index-create-sketch    Substep of index: low memory, parallelized
  search                 Search for k-mers in target sequences.

My thoughts:

  1. "Substep of index" is a bunch of characters and makes the rest of the help text not visible, so it would be nice to shorten here.
  2. It'd be nice for the index-create-* substeps to be indented under index for clarity.

@olgabot
Copy link
Contributor Author

olgabot commented Feb 28, 2025

Or maybe index-create-* could be subcommands of index and running index alone invokes them all. That way, kmerseek --help would only show index and search which feels cleaner to me.

@olgabot
Copy link
Contributor Author

olgabot commented Mar 5, 2025

Here's what the command line looks like so far:

(kmerseek-dev) 
 Wed  5 Mar - 11:25  ~/code/kmerseek   origin ☊ olgabot/separate-index-search-substeps ✔ 2☀ 
  kmerseek --help
Usage: kmerseek [OPTIONS] COMMAND [ARGS]...

  Kmerseek performs efficient protein domain annotation search with reduced
  amino acid k-mers

Options:
  --help  Show this message and exit.

Commands:
  index                           Prepare a database index for searching...
  index-01-create-sketch          Substep of Index: low memory,...
  index-02-create-kmers-pq        Substep of Index: Extract k-mer...
  index-03-create-rocksdb         Substep of Index: Creates RocksDB...
  search                          Search for k-mers in target sequences.
  search-01-create-query-sketch   Substep of Search: low memory,...
  search-02-create-query-kmers-pq
                                  Substep of Search: Extract k-mer...
  search-03-do-search             Substep of Search: Perform Sourmash...
  search-04-show-results          Substep of Search: Visualize the...

@heuermh
Copy link
Contributor

heuermh commented Mar 14, 2025

Gave this some more thought.

Leaving out rocksdb for a sec, this is what I see

kmerseek

Commands:
  index  sketch target kmers, extract target kmers
  search  sketch source kmers, extract source kmers, find matches, and visualize matches

Then for kmerseek index

kmerseek index    FASTA --> kmer signature file and kmer parquet file(s)

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file(s)

and for kmerseek search

kmerseek search    query FASTA, target FASTA, target kmer signature file, target kmer parquet file(s) --> query kmer signature file, query kmer parquet file(s), matches CSV file, vis text file or stdout

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file(s)
  find-matches    query FASTA, query kmer signature file, target FASTA, target kmer signature file --> matches CSV file
  visualize-matches    query FASTA, query kmer signature file, query kmer parquet file(s), target FASTA, target kmer signature file, target kmer parquet file(s), matches CSV file --> vis text file, or stdout

Am I missing any input or output files here?

@olgabot
Copy link
Contributor Author

olgabot commented Mar 17, 2025

@heuermh I really like this naming! Thank you so much! I have a minor correction and a question.

  1. extract-kmers produces a single parquet file per input fasta.
  2. Is it useful to specify query vs target (I've also seen against) for the sketch-kmers and extract-kmers subcommend? I think it is a bit confusing, but I'm not sure how to clarify which is being sketched when, e.g. if someone wanted to write out all the commands by hand.
kmerseek index    FASTA --> kmer signature file and kmer parquet file

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file

For sketch-kmers, the visualization is output via stderr, then the csv is sent to either a file or stdout.

kmerseek search    query FASTA, target FASTA, target kmer signature file, target kmer parquet file(s) --> query kmer signature file, query kmer parquet file(s), matches CSV file, vis text file or stdout

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file
  find-matches    query FASTA, query kmer signature file, target FASTA, target kmer signature file --> matches CSV file
  visualize-matches    query FASTA, query kmer signature file, query kmer parquet file, target FASTA, target kmer signature file, target kmer parquet file, matches CSV file --> vis text file, or stdout

Example visualization:

Query Name: sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1
Match Name: sp|Q12982|BNIP2_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 2 OS=Homo sapiens OX=9606 GN=BNIP2 PE=1 SV=1
query: RLDIEGFVVDYFTHRILFVYTSLFIKTRIRNN (76-108)
alpha: phphphhhhphhppphhhhhpphhhppphppp
match: SIEADILAITGPEDQPLLAVTRPFISSKFSQK (23-55)

@heuermh
Copy link
Contributor

heuermh commented Mar 17, 2025

  1. extract-kmers produces a single parquet file per input fasta.

Thanks for the clarification!

Note some tools may write out partitioned Parquet format, as in one or more partition files in a directory. E.g. duckdb does this when you specify PER_THREAD_OUTPUT, spark does this by default, etc.

  1. Is it useful to specify query vs target (I've also seen against) for the sketch-kmers and extract-kmers subcommand?

I presume under the hood those two subcommands are exactly the same? The query vs target distinction is more something to be handled by the caller, I think.

Copy link

1 similar comment
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants