Separate "index" and "search" into substeps for nextflow pipeline #10

olgabot · 2025-02-25T19:14:18Z

Addresses #8 and #7. Separating out these steps will make writing Nextflow pipelines easier because while having the commands be wrapped is nice, currently, some of the steps take a long time (e.g. #9 has been running for ~7 days, boo) and it would be good to have them be separate Nextflow processes in https://github.com/seanome/nf-core-kmerseek/

notion-workspace · 2025-02-27T16:55:11Z

→ Task

notion-workspace · 2025-02-27T16:55:59Z

Separate out kmerseek “index” and “search” steps into substeps for Nextflow processes

olgabot · 2025-02-28T17:38:21Z

Here's what the CLI looks like so far. Is it clear that index wraps all the index-* commands?

(kmerseek-dev) 
 Fri 28 Feb - 09:34  ~/code/kmerseek   origin ☊ olgabot/separate-index-search-substeps ↑2 1☀ 2● ‒3± ⚑ 
  kmerseek --help
Usage: kmerseek [OPTIONS] COMMAND [ARGS]...

  Kmerseek performs efficient protein domain annotation search with reduced
  amino acid k-mers

Options:
  --help  Show this message and exit.

Commands:
  index                  Prepare a database index for searching with k-mers
  index-create-kmers-pq  Substep of index: Extract k-mer sequences and...
  index-create-rocksdb   Substep of index: Creates RocksDB index for fast...
  index-create-sketch    Substep of index: low memory, parallelized
  search                 Search for k-mers in target sequences.

My thoughts:

"Substep of index" is a bunch of characters and makes the rest of the help text not visible, so it would be nice to shorten here.
It'd be nice for the index-create-* substeps to be indented under index for clarity.

olgabot · 2025-02-28T17:39:38Z

Or maybe index-create-* could be subcommands of index and running index alone invokes them all. That way, kmerseek --help would only show index and search which feels cleaner to me.

olgabot · 2025-03-05T19:26:55Z

Here's what the command line looks like so far:

(kmerseek-dev) 
 Wed  5 Mar - 11:25  ~/code/kmerseek   origin ☊ olgabot/separate-index-search-substeps ✔ 2☀ 
  kmerseek --help
Usage: kmerseek [OPTIONS] COMMAND [ARGS]...

  Kmerseek performs efficient protein domain annotation search with reduced
  amino acid k-mers

Options:
  --help  Show this message and exit.

Commands:
  index                           Prepare a database index for searching...
  index-01-create-sketch          Substep of Index: low memory,...
  index-02-create-kmers-pq        Substep of Index: Extract k-mer...
  index-03-create-rocksdb         Substep of Index: Creates RocksDB...
  search                          Search for k-mers in target sequences.
  search-01-create-query-sketch   Substep of Search: low memory,...
  search-02-create-query-kmers-pq
                                  Substep of Search: Extract k-mer...
  search-03-do-search             Substep of Search: Perform Sourmash...
  search-04-show-results          Substep of Search: Visualize the...

heuermh · 2025-03-14T00:02:48Z

Gave this some more thought.

Leaving out rocksdb for a sec, this is what I see

kmerseek

Commands:
  index  sketch target kmers, extract target kmers
  search  sketch source kmers, extract source kmers, find matches, and visualize matches

Then for kmerseek index

kmerseek index    FASTA --> kmer signature file and kmer parquet file(s)

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file(s)

and for kmerseek search

kmerseek search    query FASTA, target FASTA, target kmer signature file, target kmer parquet file(s) --> query kmer signature file, query kmer parquet file(s), matches CSV file, vis text file or stdout

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file(s)
  find-matches    query FASTA, query kmer signature file, target FASTA, target kmer signature file --> matches CSV file
  visualize-matches    query FASTA, query kmer signature file, query kmer parquet file(s), target FASTA, target kmer signature file, target kmer parquet file(s), matches CSV file --> vis text file, or stdout

Am I missing any input or output files here?

olgabot · 2025-03-17T16:34:10Z

@heuermh I really like this naming! Thank you so much! I have a minor correction and a question.

extract-kmers produces a single parquet file per input fasta.
Is it useful to specify query vs target (I've also seen against) for the sketch-kmers and extract-kmers subcommend? I think it is a bit confusing, but I'm not sure how to clarify which is being sketched when, e.g. if someone wanted to write out all the commands by hand.

kmerseek index    FASTA --> kmer signature file and kmer parquet file

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file

For sketch-kmers, the visualization is output via stderr, then the csv is sent to either a file or stdout.

kmerseek search    query FASTA, target FASTA, target kmer signature file, target kmer parquet file(s) --> query kmer signature file, query kmer parquet file(s), matches CSV file, vis text file or stdout

Subcommands:
  sketch-kmers    FASTA --> kmer signature file
  extract-kmers    FASTA --> kmer parquet file
  find-matches    query FASTA, query kmer signature file, target FASTA, target kmer signature file --> matches CSV file
  visualize-matches    query FASTA, query kmer signature file, query kmer parquet file, target FASTA, target kmer signature file, target kmer parquet file, matches CSV file --> vis text file, or stdout

Example visualization:

Query Name: sp|P41958|CED9_CAEEL Apoptosis regulator ced-9 OS=Caenorhabditis elegans OX=6239 GN=ced-9 PE=1 SV=1
Match Name: sp|Q12982|BNIP2_HUMAN BCL2/adenovirus E1B 19 kDa protein-interacting protein 2 OS=Homo sapiens OX=9606 GN=BNIP2 PE=1 SV=1
query: RLDIEGFVVDYFTHRILFVYTSLFIKTRIRNN (76-108)
alpha: phphphhhhphhppphhhhhpphhhppphppp
match: SIEADILAITGPEDQPLLAVTRPFISSKFSQK (23-55)

heuermh · 2025-03-17T16:59:37Z

extract-kmers produces a single parquet file per input fasta.

Thanks for the clarification!

Note some tools may write out partitioned Parquet format, as in one or more partition files in a directory. E.g. duckdb does this when you specify PER_THREAD_OUTPUT, spark does this by default, etc.

Is it useful to specify query vs target (I've also seen against) for the sketch-kmers and extract-kmers subcommand?

I presume under the hood those two subcommands are exactly the same? The query vs target distinction is more something to be handled by the caller, I think.

notion-workspace · 2025-03-20T00:07:47Z

Rename substeps of kmerseek from @Michael Heuer’s suggestions

notion-workspace · 2025-03-20T00:07:48Z

Rename substeps of kmerseek from @Michael Heuer’s suggestions

olgabot added 8 commits February 25, 2025 11:10

Separate out indexing steps for use in Nextflow pipelines

8d7e202

Add isort and black profile to pyproject.toml

f11cfa3

Black formatting on search.py

3f68096

Add logger to sig2kmer.py

ddb8428

Merge branch 'main' into olgabot/separate-index-search-substeps

5aff34b

Merged rename sig -> sketch, separating out search to components

7573a29

Merged separating out index substeps

0a5ffa2

rename: sig -> sketch

e6c6fa7

olgabot added 5 commits February 28, 2025 09:26

Black formatting

47fafb9

Get index subcommands working

18ca1ff

remove manyketch test files

7ce0ec7

Add index substeps and documentation"

1db4d4d

Add index substeps to CLI

3ad612a

olgabot added 4 commits March 5, 2025 11:09

Add --debug flag to indexing commands

70bc9aa

Make subcommands for kmerseek search

5a6ad77

rename subcommands by their order

aafe6d0

Update docstrings

6fbc557

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate "index" and "search" into substeps for nextflow pipeline #10

Separate "index" and "search" into substeps for nextflow pipeline #10

olgabot commented Feb 25, 2025 •

edited

Loading

notion-workspace bot commented Feb 27, 2025

notion-workspace bot commented Feb 27, 2025

olgabot commented Feb 28, 2025

olgabot commented Feb 28, 2025

olgabot commented Mar 5, 2025

heuermh commented Mar 14, 2025

olgabot commented Mar 17, 2025

heuermh commented Mar 17, 2025

notion-workspace bot commented Mar 20, 2025

notion-workspace bot commented Mar 20, 2025

Separate "index" and "search" into substeps for nextflow pipeline #10

Are you sure you want to change the base?

Separate "index" and "search" into substeps for nextflow pipeline #10

Conversation

olgabot commented Feb 25, 2025 • edited Loading

notion-workspace bot commented Feb 27, 2025

notion-workspace bot commented Feb 27, 2025

olgabot commented Feb 28, 2025

olgabot commented Feb 28, 2025

olgabot commented Mar 5, 2025

heuermh commented Mar 14, 2025

olgabot commented Mar 17, 2025

heuermh commented Mar 17, 2025

notion-workspace bot commented Mar 20, 2025

notion-workspace bot commented Mar 20, 2025

olgabot commented Feb 25, 2025 •

edited

Loading