|
1 |
| -# OMAmer |
| 1 | +# OMAmer - tree-driven and alignment-free protein assignment to subfamilies |
2 | 2 |
|
3 |
| -OMAmer is a novel alignment-free protein family assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. It is based on an innovative method using evolutionnary-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, it has provided better and quicker subfamily-level assignments than a method based on closest sequences (using DIAMOND). |
| 3 | +OMAmer is a novel alignment-free protein family assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. It is based on an innovative method using evolutionary-informed _k_-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, it has provided better and quicker subfamily-level assignments than a method based on closest sequences (using DIAMOND). |
4 | 4 |
|
5 | 5 | # Installation
|
6 |
| -Requires Python >= 3.6. Download the package from the PyPI, resolving the dependencies by using ``pip install omamer``. |
| 6 | +Requires Python >= 3.8. Download the package from the PyPI, resolving the dependencies by using ``pip install omamer``. |
7 | 7 |
|
8 | 8 | Alternatively, clone this repository and install manually.
|
9 | 9 |
|
| 10 | +Note: Python 3.12 is currently not supported, until the `numba` package is updated ([issue](https://github.com/numba/numba/issues/9197)). |
| 11 | + |
10 | 12 | # Pre-Built Databases
|
11 | 13 |
|
12 | 14 | Pre-built databases are available for the latest OMA release from the [download section on the OMA Browser website](https://omabrowser.org/oma/current).
|
13 | 15 |
|
14 | 16 | - LUCA: https://omabrowser.org/All/LUCA.h5
|
15 |
| - - Metazoa: https://omabrowser.org/All/Metazoa.h5 |
| 17 | + - _Metazoa: https://omabrowser.org/All/Metazoa.h5 |
| 18 | + - _Viridiplantae_: https://omabrowser.org/All/Viridiplantae.h5 |
| 19 | + - _Saccharomyceta_: https://omabrowser.org/All/Saccharomyceta.h5 |
| 20 | + - _Primates_: https://omabrowser.org/All/Primates.h5 |
16 | 21 |
|
17 | 22 | Their names indicate the root-taxon parameter used. Other non-required parameters were left to default.
|
18 | 23 |
|
19 | 24 | Note: databases included in the [Zenodo upload](https://zenodo.org/record/4593702) from the manuscript are not supported by the most recent version of OMAmer. We recommend using the most recent release with databases built on the most recent OMA browser release.
|
20 | 25 |
|
| 26 | + |
| 27 | + |
| 28 | +# omamer search - Searching a Database |
| 29 | +Assign proteins to families and subfamilies in a pre-existing database. |
| 30 | +## Usage |
| 31 | +Required arguments: ``--db``, ``--query`` |
| 32 | + |
| 33 | + usage: omamer search [-h] -d DB -q QUERY [--threshold THRESHOLD] [--family_alpha FAMILY_ALPHA] [-fo] [-n TOP_N_FAMS] [--reference_taxon REFERENCE_TAXON] |
| 34 | + [-o OUT] [--include_extant_genes] [-c CHUNKSIZE] [-t {0,1,2,3,4,5,6,7,8}] [--log_level {debug,info,warning}] [--silent] |
| 35 | + |
| 36 | +## Arguments |
| 37 | +### Quick reference table |
| 38 | + |
| 39 | +| Short Flag | Flag | Default | Description | |
| 40 | +|:-----------|:---------------------|:-----------------------|:------------| |
| 41 | +| [``-d``](#markdown-header-d) | [``--db``](#markdown-header--db) || Path to existing database (including filename) |
| 42 | +| [``-q``](#markdown-header-q) | [``--query``](#markdown-header--query) || Path to FASTA formatted sequences |
| 43 | +| | [``--threshold``](#markdown-header--threshold) | 0.1 | Threshold applied on the OMAmer-score that is used to vary the specificity of predicted HOGs. The lower the theshold the more (over-)specific predicted HOGs will be. |
| 44 | +| | [``--family_alpha``](#markdown-header--family_alpha) | 1e-6 | Significance threshold used when filtering families. |
| 45 | +| [``-fo``](#markdown-header-fo) | [``--family_only``](#markdown-header--family_only) | False | If set, only place at the family level. Useful for certain analysis. Note: `subfamily_medianseqlen` in the results is for the family level. |
| 46 | +| [``-n``](#markdown-header-n) | [``--family_only``](#markdown-header--top_n_fams) | 1 | Number of top level families to place into. By default, placed into only the best scoring family. |
| 47 | +<!--| | [``--reference_taxon``](#markdown-header--reference_taxon) || The placement is stopped when reaching a HOG with the reference taxon (must exist in the OMA database). This is a complementary option to vary the specificity of predicted HOGs.--> |
| 48 | +| [``-o``](#markdown-header-o) | [``--out``](#markdown-header--db) | stdout | Path to output. If not set, defaults to stdout. |
| 49 | +| | [``--include_extant_genes``](#markdown-header--include_extant_genes)||Include extant gene IDs as comma separated entry in results |
| 50 | +| [``-c``](#markdown-header-c) | [``--chunksize``](#markdown-header--chunksize) |10000| Number of queries to process at once. |
| 51 | +| [``-t``](#markdown-header-t) | [``--nthreads``](#markdown-header--db) |1|Number of threads to use |
| 52 | +| | [``--log_level``](#markdown-header--db) |info| Logging level (options debug, info, warning) |
| 53 | +| | [``--silent``](#markdown-header--silent) || Set to silence the output. |
| 54 | + |
| 55 | +# Output |
| 56 | + |
| 57 | +Output is in the form of a tab-seperated value file (TSV), with metadata added to the header using ``!<tag>: <value>``. A parser can be imported for further analysis in python as ``from omamer.results_reader import results_reader``. |
| 58 | + |
| 59 | +## Output Columns |
| 60 | + |
| 61 | +#### Query sequence identifier (`qseqid`) |
| 62 | +The sequence identifier from the input FASTA-formatted sequences. |
| 63 | + |
| 64 | +#### Predicted HOG identifier (`hogid`) |
| 65 | +The identifier of the hierarchical orthologous group (HOG) in OMA, which you can access through the OMA browser search bar or its REST API (https://omabrowser.org/api/docs). |
| 66 | + |
| 67 | +A HOG identifier is composed of the root-HOG identifier (following “HOG:” and before the first dot), which is followed by its sub-HOGs (before each subsequent dot). For example, for subfamily HOG:0487954.3l.27l, HOG:0487954 is the root-HOG (HOG without-parent), HOG:0487954.3l is its child and HOG:0487954.3l.27l its grandchild. |
| 68 | + |
| 69 | +#### Predicted HOG taxonomic level (`hoglevel`) |
| 70 | +The taxonomic level that the predicted HOG is defined at. |
| 71 | + |
| 72 | +#### Family p-value (`family_p`) |
| 73 | +p-value of having as many or more of k-mers in common under a binomial distribution. Reported in negative natural log units. |
| 74 | + |
| 75 | +#### Family count (`family_count`) |
| 76 | +Count of _k_-mers in common with the family / root level HOG. |
| 77 | + |
| 78 | +#### Family normalised count (`family_normcount`) |
| 79 | +Family count, normalised by the expected number of hits for the query's sequence length, with the family's _k_-mer content. |
| 80 | + |
| 81 | +#### Sub-Family score (`subfamily_score`) |
| 82 | +The OMAmer-score of the predicted HOG. At the subfamily level, this score captures the excess of similarity that is shared between the query and a given HOG, thus excluding the similarity with regions conserved in more ancestral HOGs. |
| 83 | + |
| 84 | +#### Sub-Family count (`subfamily_count`) |
| 85 | +Count of _k_-mers in common with the sub-family / HOG. |
| 86 | + |
| 87 | +#### Query sequence length (`qseqlen`) |
| 88 | +Count of _k_-mers in common with the sub-family / HOG. |
| 89 | + |
| 90 | +#### Sub-Family median sequence length (`subfamily_medianseqlen`) |
| 91 | +Median length of the sequences that are present in the predicted HOG. In the case of family-only placement, this is instead reported at the root-HOG level. |
| 92 | + |
| 93 | +#### Query sequence overlap (`qseq_overlap`) |
| 94 | +The proportion of the query sequence overlapping with _k_-mers of reference root-HOGs. This may be helpful to reject partially homologous matches that are problematic in some applications. |
| 95 | + |
| 96 | +#### Sub-Family gene set (`subfamily_geneset`) |
| 97 | +Optionally printed (see ``--include_extant_genes``). Comma-seperated list of extant gene IDs of predicted HOG. The [OMA browser](https://omabrowser.org) can be used to find out more information. In particular, using the [REST API](https://omabrowser.org/api/docs), or via the [Python API Client](https://github.com/DessimozLab/pyomadb). |
| 98 | + |
| 99 | +<!-- #### Closest taxon from reference taxon (`closest_taxa`) |
| 100 | +The taxon from the predicted HOG that is closest from the reference taxon (given one was provided). This option provides a mean to evaluate the performance of OMAmer placement given some knowledge of the query taxonomy is available. |
| 101 | +--> |
| 102 | + |
| 103 | + |
21 | 104 | # omamer mkdb - Building a Database
|
22 | 105 | This is currently reliant on the OMA browser's database file and the species phylogeny of HOGs. Building using OrthoXML files available shortly.
|
23 | 106 | - https://omabrowser.org/All/OmaServer.h5
|
@@ -45,57 +128,14 @@ Required arguments: ``--db``, ``--oma_path``
|
45 | 128 | | [``--oma_path``](#markdown-header--oma_path)||Path to a directory with both OmaServer.h5 and speciestree.nwk
|
46 | 129 | | [``--log_level``](#markdown-header--log_level)|info|Logging level
|
47 | 130 |
|
48 |
| -# omamer search - Searching a Database |
49 |
| -Assign proteins to families and subfamilies in a pre-existing database. |
50 |
| -## Usage |
51 |
| -Required arguments: ``--db``, ``--query`` |
52 |
| - |
53 |
| - usage: omamer search [-h] --db DB --query QUERY [--score {default,sensitive}] [--threshold THRESHOLD] [--reference_taxon REFERENCE_TAXON] [--out OUT] |
54 |
| - [--include_extant_genes] [--chunksize CHUNKSIZE] [--nthreads NTHREADS] [--log_level {debug,info,warning}] |
55 |
| - |
56 |
| -## Arguments |
57 |
| -### Quick reference table |
58 |
| - |
59 |
| -| Flag | Default | Description | |
60 |
| -|:--------------------|:----------------------|:-----------| |
61 |
| -| [``--db``](#markdown-header--db) || Path to existing database (including filename) |
62 |
| -| [``--query``](#markdown-header--query) || Path to FASTA formatted sequences |
63 |
| -| [``--score``](#markdown-header--score) |default| Type of OMAmer-score to use. Options are "default" and "sensitive". |
64 |
| -| [``--threshold``](#markdown-header--threshold) |0.05| Threshold applied on the OMAmer-score that is used to vary the specificity of predicted HOGs. The lower the theshold the more (over-)specific predicted HOGs will be. |
65 |
| -| [``--reference_taxon``](#markdown-header--reference_taxon) || The placement is stopped when reaching a HOG with the reference taxon (must exist in the OMA database). This is a complementary option to vary the specificity of predicted HOGs. |
66 |
| -| [``--out``](#markdown-header--db) |stdout| Path to output (default stdout) |
67 |
| -| [``--include_extant_genes``](#markdown-header--include_extant_genes)||Include extant gene IDs as comma separated entry in results |
68 |
| -| [``--chunksize``](#markdown-header--chunksize) |10000| Number of queries to process at once. |
69 |
| -| [``--nthreads``](#markdown-header--db) |1|Number of threads to use |
70 |
| -| [``--log_level``](#markdown-header--db) |info| Logging level |
71 |
| - |
72 |
| -# Output columns |
73 |
| - |
74 |
| -#### Query sequence identifier |
75 |
| -The sequence identifier from the input fasta |
76 |
| - |
77 |
| -#### Predicted HOG identifier |
78 |
| -The identifier of the hierarchical orthologous group (HOG) in OMA, which you can access through the OMA browser search bar or its REST API (https://omabrowser.org/api/docs). |
79 |
| - |
80 |
| -A HOG identifier is composed of the root-HOG identifier (following “HOG:” and before the first dot), which is followed by its sub-HOGs (before each subsequent dot). For example, for subfamily HOG:0487954.3l.27l, HOG:0487954 is the root-HOG (HOG without-parent), HOG:0487954.3l is its child and HOG:0487954.3l.27l its grandchild. |
81 |
| - |
82 |
| -#### Closest taxon from reference taxon |
83 |
| -The taxon from the predicted HOG that is closest from the reference taxon (given one was provided). This option provides a mean to evaluate the performance of OMAmer placement given some knowledge of the query taxonomy is available. |
84 |
| - |
85 |
| -#### Overlap-score |
86 |
| -The fraction of the query sequence overlapping with k-mers of reference root-HOGs. This score aims to help reject partial homologous matches that are problematic in some applications. |
87 |
| - |
88 |
| -#### Family-level OMAmer-score |
89 |
| -The OMAmer-score of the predicted root-HOG. At the family level, this score measures the sequence similarity between the query and a given root-HOG. |
90 |
| - |
91 |
| -#### Subfamily-level OMAmer-score |
92 |
| -The OMAmer-score of the predicted HOG. At the subfamily level, this score captures the excess of similarity that is shared between the query and a given HOG, thus excluding the similarity with regions conserved in more ancestral HOGs. |
93 |
| - |
94 |
| -#### Subfamily gene set |
95 |
| -Extant gene IDs of predicted HOG, which you can look for in the OMA browser search bar or its REST API (https://omabrowser.org/api/docs). |
96 | 131 |
|
97 | 132 | # Change log
|
98 | 133 |
|
| 134 | +#### Version 2.0.0 |
| 135 | + - Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM. |
| 136 | + - Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query. |
| 137 | + - UX improvements - more feedback during interactive search runs, whilst maintaining small log files. |
| 138 | + |
99 | 139 | #### Version 0.2.5
|
100 | 140 | - Fixes an issue when storing the pre-conputed statistics
|
101 | 141 |
|
@@ -134,5 +174,3 @@ You should have received a copy of the GNU Lesser General Public License along w
|
134 | 174 | Victor Rossier, Alex Warwick Vesztrocy, Marc Robinson-Rechavi, Christophe Dessimoz, OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches, Bioinformatics, 2021;, btab219, https://doi.org/10.1093/bioinformatics/btab219
|
135 | 175 |
|
136 | 176 | Code used for that paper is available here: [](https://doi.org/10.5281/zenodo.4593702)
|
137 |
| - |
138 |
| - |
|
0 commit comments