Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMSeqs taxonomy coverage value #750

Open
pbelmann opened this issue Sep 13, 2023 · 1 comment
Open

MMSeqs taxonomy coverage value #750

pbelmann opened this issue Sep 13, 2023 · 1 comment

Comments

@pbelmann
Copy link

I would like to assign a taxonomic label to my protein sequences using the blast NR database and the mmseqs taxonomy command available in the docker image (quay.io/microbiome-informatics/mmseqs:2.13). I noticed that the default coverage value is zero according to the help page.
-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]

However, if I set the -c parameter to 0.8, which sounds reasonable to me, then all of my sequences are labelled as no rank unclassified.

Full command:

mmseqs taxonomy queryDB ${MMSEQS2_DATABASE_DIR} taxresults.database tmp  --lca-ranks superkingdom,phylum,class,order,family,genus,species,subspecies   --threads 28

My questions are:

  1. Doesn't it always make sense to increase the -c parameter to reduce spurious hits?
  2. How can I inspect the alignment of the best hit?
  3. Does easy-taxonomy also use the same default lca-mode or a different one?
@milot-mirdita
Copy link
Member

  1. I am looking through the code and seeing some bugs in how coverage works within the alignment for taxonomy.
    Ignoring if this makes sense or not, its definitely broken code-wise.

It also would not be super well defined which coverage to compute, since we do multiple alignments with the 2bLCA procedure. What is currently implemented (however broken) is that it would try to compute the coverage between the extracted subfragment of the database against the other database hits.

https://github.com/soedinglab/MMseqs2/wiki#the-concept-of-lca In the figure here this would be the coverage of the pink hit 1 fragment versus Hit 2, 3 and 4. I am not sure which coverage would make the most sense to compute and in any case would require us to run new benchmarks.

  1. You need to pass --tax-output-mode 2 to also compute and store the alignments. They will be placed at taxresults.database_aln in your case.

  2. easy-taxonomy and taxonomy behave the same, the only difference is that the former takes FASTA input while the later only takes MMseqs2 databases.

The main algorithmic difference depends on the input type though. With nucleotide input it will use the contig taxonomy procedure described in the MMseqs2 taxonomy paper, this includes the fast ORF-prefiltering and the taxonomy majority voting.

The ORF-prefiltering can be overaggressive for short-reads, our previous recommendation was to disable the ORF-prefiltering with --orf-filter 0 if you give it short read input. We are currently developing a better fix in #832 currently that should not require messing with this parameter.

For protein input, the ORF-filtering and majority voting does not happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants