You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to assign a taxonomic label to my protein sequences using the blast NR database and the mmseqs taxonomy command available in the docker image (quay.io/microbiome-informatics/mmseqs:2.13). I noticed that the default coverage value is zero according to the help page. -c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]
However, if I set the -c parameter to 0.8, which sounds reasonable to me, then all of my sequences are labelled as no rank unclassified.
I am looking through the code and seeing some bugs in how coverage works within the alignment for taxonomy.
Ignoring if this makes sense or not, its definitely broken code-wise.
It also would not be super well defined which coverage to compute, since we do multiple alignments with the 2bLCA procedure. What is currently implemented (however broken) is that it would try to compute the coverage between the extracted subfragment of the database against the other database hits.
You need to pass --tax-output-mode 2 to also compute and store the alignments. They will be placed at taxresults.database_aln in your case.
easy-taxonomy and taxonomy behave the same, the only difference is that the former takes FASTA input while the later only takes MMseqs2 databases.
The main algorithmic difference depends on the input type though. With nucleotide input it will use the contig taxonomy procedure described in the MMseqs2 taxonomy paper, this includes the fast ORF-prefiltering and the taxonomy majority voting.
The ORF-prefiltering can be overaggressive for short-reads, our previous recommendation was to disable the ORF-prefiltering with --orf-filter 0 if you give it short read input. We are currently developing a better fix in #832 currently that should not require messing with this parameter.
For protein input, the ORF-filtering and majority voting does not happen.
I would like to assign a taxonomic label to my protein sequences using the blast NR database and the mmseqs taxonomy command available in the docker image (quay.io/microbiome-informatics/mmseqs:2.13). I noticed that the default coverage value is zero according to the help page.
-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]
However, if I set the
-c
parameter to 0.8, which sounds reasonable to me, then all of my sequences are labelled asno rank unclassified
.Full command:
My questions are:
-c
parameter to reduce spurious hits?lca-mode
or a different one?The text was updated successfully, but these errors were encountered: