Skip to content

Commit 2c295d2

Browse files
committed
Updating readme for 2.0.0
1 parent d8cc567 commit 2c295d2

File tree

1 file changed

+92
-54
lines changed

1 file changed

+92
-54
lines changed

README.md

+92-54
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,106 @@
1-
# OMAmer
1+
# OMAmer - tree-driven and alignment-free protein assignment to subfamilies
22

3-
OMAmer is a novel alignment-free protein family assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. It is based on an innovative method using evolutionnary-informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, it has provided better and quicker subfamily-level assignments than a method based on closest sequences (using DIAMOND).
3+
OMAmer is a novel alignment-free protein family assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. It is based on an innovative method using evolutionary-informed _k_-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, it has provided better and quicker subfamily-level assignments than a method based on closest sequences (using DIAMOND).
44

55
# Installation
6-
Requires Python >= 3.6. Download the package from the PyPI, resolving the dependencies by using ``pip install omamer``.
6+
Requires Python >= 3.8. Download the package from the PyPI, resolving the dependencies by using ``pip install omamer``.
77

88
Alternatively, clone this repository and install manually.
99

10+
Note: Python 3.12 is currently not supported, until the `numba` package is updated ([issue](https://github.com/numba/numba/issues/9197)).
11+
1012
# Pre-Built Databases
1113

1214
Pre-built databases are available for the latest OMA release from the [download section on the OMA Browser website](https://omabrowser.org/oma/current).
1315

1416
- LUCA: https://omabrowser.org/All/LUCA.h5
15-
- Metazoa: https://omabrowser.org/All/Metazoa.h5
17+
- _Metazoa: https://omabrowser.org/All/Metazoa.h5
18+
- _Viridiplantae_: https://omabrowser.org/All/Viridiplantae.h5
19+
- _Saccharomyceta_: https://omabrowser.org/All/Saccharomyceta.h5
20+
- _Primates_: https://omabrowser.org/All/Primates.h5
1621

1722
Their names indicate the root-taxon parameter used. Other non-required parameters were left to default.
1823

1924
Note: databases included in the [Zenodo upload](https://zenodo.org/record/4593702) from the manuscript are not supported by the most recent version of OMAmer. We recommend using the most recent release with databases built on the most recent OMA browser release.
2025

26+
27+
28+
# omamer search - Searching a Database
29+
Assign proteins to families and subfamilies in a pre-existing database.
30+
## Usage
31+
Required arguments: ``--db``, ``--query``
32+
33+
usage: omamer search [-h] -d DB -q QUERY [--threshold THRESHOLD] [--family_alpha FAMILY_ALPHA] [-fo] [-n TOP_N_FAMS] [--reference_taxon REFERENCE_TAXON]
34+
[-o OUT] [--include_extant_genes] [-c CHUNKSIZE] [-t {0,1,2,3,4,5,6,7,8}] [--log_level {debug,info,warning}] [--silent]
35+
36+
## Arguments
37+
### Quick reference table
38+
39+
| Short Flag | Flag | Default | Description |
40+
|:-----------|:---------------------|:-----------------------|:------------|
41+
| [``-d``](#markdown-header-d) | [``--db``](#markdown-header--db) || Path to existing database (including filename)
42+
| [``-q``](#markdown-header-q) | [``--query``](#markdown-header--query) || Path to FASTA formatted sequences
43+
| | [``--threshold``](#markdown-header--threshold) | 0.1 | Threshold applied on the OMAmer-score that is used to vary the specificity of predicted HOGs. The lower the theshold the more (over-)specific predicted HOGs will be.
44+
| | [``--family_alpha``](#markdown-header--family_alpha) | 1e-6 | Significance threshold used when filtering families.
45+
| [``-fo``](#markdown-header-fo) | [``--family_only``](#markdown-header--family_only) | False | If set, only place at the family level. Useful for certain analysis. Note: `subfamily_medianseqlen` in the results is for the family level.
46+
| [``-n``](#markdown-header-n) | [``--family_only``](#markdown-header--top_n_fams) | 1 | Number of top level families to place into. By default, placed into only the best scoring family.
47+
<!--| | [``--reference_taxon``](#markdown-header--reference_taxon) || The placement is stopped when reaching a HOG with the reference taxon (must exist in the OMA database). This is a complementary option to vary the specificity of predicted HOGs.-->
48+
| [``-o``](#markdown-header-o) | [``--out``](#markdown-header--db) | stdout | Path to output. If not set, defaults to stdout.
49+
| | [``--include_extant_genes``](#markdown-header--include_extant_genes)||Include extant gene IDs as comma separated entry in results
50+
| [``-c``](#markdown-header-c) | [``--chunksize``](#markdown-header--chunksize) |10000| Number of queries to process at once.
51+
| [``-t``](#markdown-header-t) | [``--nthreads``](#markdown-header--db) |1|Number of threads to use
52+
| | [``--log_level``](#markdown-header--db) |info| Logging level (options debug, info, warning)
53+
| | [``--silent``](#markdown-header--silent) || Set to silence the output.
54+
55+
# Output
56+
57+
Output is in the form of a tab-seperated value file (TSV), with metadata added to the header using ``!<tag>: <value>``. A parser can be imported for further analysis in python as ``from omamer.results_reader import results_reader``.
58+
59+
## Output Columns
60+
61+
#### Query sequence identifier (`qseqid`)
62+
The sequence identifier from the input FASTA-formatted sequences.
63+
64+
#### Predicted HOG identifier (`hogid`)
65+
The identifier of the hierarchical orthologous group (HOG) in OMA, which you can access through the OMA browser search bar or its REST API (https://omabrowser.org/api/docs).
66+
67+
A HOG identifier is composed of the root-HOG identifier (following “HOG:” and before the first dot), which is followed by its sub-HOGs (before each subsequent dot). For example, for subfamily HOG:0487954.3l.27l, HOG:0487954 is the root-HOG (HOG without-parent), HOG:0487954.3l is its child and HOG:0487954.3l.27l its grandchild.
68+
69+
#### Predicted HOG taxonomic level (`hoglevel`)
70+
The taxonomic level that the predicted HOG is defined at.
71+
72+
#### Family p-value (`family_p`)
73+
p-value of having as many or more of k-mers in common under a binomial distribution. Reported in negative natural log units.
74+
75+
#### Family count (`family_count`)
76+
Count of _k_-mers in common with the family / root level HOG.
77+
78+
#### Family normalised count (`family_normcount`)
79+
Family count, normalised by the expected number of hits for the query's sequence length, with the family's _k_-mer content.
80+
81+
#### Sub-Family score (`subfamily_score`)
82+
The OMAmer-score of the predicted HOG. At the subfamily level, this score captures the excess of similarity that is shared between the query and a given HOG, thus excluding the similarity with regions conserved in more ancestral HOGs.
83+
84+
#### Sub-Family count (`subfamily_count`)
85+
Count of _k_-mers in common with the sub-family / HOG.
86+
87+
#### Query sequence length (`qseqlen`)
88+
Count of _k_-mers in common with the sub-family / HOG.
89+
90+
#### Sub-Family median sequence length (`subfamily_medianseqlen`)
91+
Median length of the sequences that are present in the predicted HOG. In the case of family-only placement, this is instead reported at the root-HOG level.
92+
93+
#### Query sequence overlap (`qseq_overlap`)
94+
The proportion of the query sequence overlapping with _k_-mers of reference root-HOGs. This may be helpful to reject partially homologous matches that are problematic in some applications.
95+
96+
#### Sub-Family gene set (`subfamily_geneset`)
97+
Optionally printed (see ``--include_extant_genes``). Comma-seperated list of extant gene IDs of predicted HOG. The [OMA browser](https://omabrowser.org) can be used to find out more information. In particular, using the [REST API](https://omabrowser.org/api/docs), or via the [Python API Client](https://github.com/DessimozLab/pyomadb).
98+
99+
<!-- #### Closest taxon from reference taxon (`closest_taxa`)
100+
The taxon from the predicted HOG that is closest from the reference taxon (given one was provided). This option provides a mean to evaluate the performance of OMAmer placement given some knowledge of the query taxonomy is available.
101+
-->
102+
103+
21104
# omamer mkdb - Building a Database
22105
This is currently reliant on the OMA browser's database file and the species phylogeny of HOGs. Building using OrthoXML files available shortly.
23106
- https://omabrowser.org/All/OmaServer.h5
@@ -45,57 +128,14 @@ Required arguments: ``--db``, ``--oma_path``
45128
| [``--oma_path``](#markdown-header--oma_path)||Path to a directory with both OmaServer.h5 and speciestree.nwk
46129
| [``--log_level``](#markdown-header--log_level)|info|Logging level
47130

48-
# omamer search - Searching a Database
49-
Assign proteins to families and subfamilies in a pre-existing database.
50-
## Usage
51-
Required arguments: ``--db``, ``--query``
52-
53-
usage: omamer search [-h] --db DB --query QUERY [--score {default,sensitive}] [--threshold THRESHOLD] [--reference_taxon REFERENCE_TAXON] [--out OUT]
54-
[--include_extant_genes] [--chunksize CHUNKSIZE] [--nthreads NTHREADS] [--log_level {debug,info,warning}]
55-
56-
## Arguments
57-
### Quick reference table
58-
59-
| Flag | Default | Description |
60-
|:--------------------|:----------------------|:-----------|
61-
| [``--db``](#markdown-header--db) || Path to existing database (including filename)
62-
| [``--query``](#markdown-header--query) || Path to FASTA formatted sequences
63-
| [``--score``](#markdown-header--score) |default| Type of OMAmer-score to use. Options are "default" and "sensitive".
64-
| [``--threshold``](#markdown-header--threshold) |0.05| Threshold applied on the OMAmer-score that is used to vary the specificity of predicted HOGs. The lower the theshold the more (over-)specific predicted HOGs will be.
65-
| [``--reference_taxon``](#markdown-header--reference_taxon) || The placement is stopped when reaching a HOG with the reference taxon (must exist in the OMA database). This is a complementary option to vary the specificity of predicted HOGs.
66-
| [``--out``](#markdown-header--db) |stdout| Path to output (default stdout)
67-
| [``--include_extant_genes``](#markdown-header--include_extant_genes)||Include extant gene IDs as comma separated entry in results
68-
| [``--chunksize``](#markdown-header--chunksize) |10000| Number of queries to process at once.
69-
| [``--nthreads``](#markdown-header--db) |1|Number of threads to use
70-
| [``--log_level``](#markdown-header--db) |info| Logging level
71-
72-
# Output columns
73-
74-
#### Query sequence identifier
75-
The sequence identifier from the input fasta
76-
77-
#### Predicted HOG identifier
78-
The identifier of the hierarchical orthologous group (HOG) in OMA, which you can access through the OMA browser search bar or its REST API (https://omabrowser.org/api/docs).
79-
80-
A HOG identifier is composed of the root-HOG identifier (following “HOG:” and before the first dot), which is followed by its sub-HOGs (before each subsequent dot). For example, for subfamily HOG:0487954.3l.27l, HOG:0487954 is the root-HOG (HOG without-parent), HOG:0487954.3l is its child and HOG:0487954.3l.27l its grandchild.
81-
82-
#### Closest taxon from reference taxon
83-
The taxon from the predicted HOG that is closest from the reference taxon (given one was provided). This option provides a mean to evaluate the performance of OMAmer placement given some knowledge of the query taxonomy is available.
84-
85-
#### Overlap-score
86-
The fraction of the query sequence overlapping with k-mers of reference root-HOGs. This score aims to help reject partial homologous matches that are problematic in some applications.
87-
88-
#### Family-level OMAmer-score
89-
The OMAmer-score of the predicted root-HOG. At the family level, this score measures the sequence similarity between the query and a given root-HOG.
90-
91-
#### Subfamily-level OMAmer-score
92-
The OMAmer-score of the predicted HOG. At the subfamily level, this score captures the excess of similarity that is shared between the query and a given HOG, thus excluding the similarity with regions conserved in more ancestral HOGs.
93-
94-
#### Subfamily gene set
95-
Extant gene IDs of predicted HOG, which you can look for in the OMA browser search bar or its REST API (https://omabrowser.org/api/docs).
96131

97132
# Change log
98133

134+
#### Version 2.0.0
135+
- Major update of database format and search code to improve overall memory useage. Most standard runs with LUCA-level database will run on a machine with 16GB RAM.
136+
- Update to the scoring algorithm for root-level HOG / family assignments, to allow for significance testing. This estimates a binomial distribution for each family, so that we can compute the probability of matching at least as many k-mers as we have observed by chance, for each family that has a match to a given query.
137+
- UX improvements - more feedback during interactive search runs, whilst maintaining small log files.
138+
99139
#### Version 0.2.5
100140
- Fixes an issue when storing the pre-conputed statistics
101141

@@ -134,5 +174,3 @@ You should have received a copy of the GNU Lesser General Public License along w
134174
Victor Rossier, Alex Warwick Vesztrocy, Marc Robinson-Rechavi, Christophe Dessimoz, OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches, Bioinformatics, 2021;, btab219, https://doi.org/10.1093/bioinformatics/btab219
135175

136176
Code used for that paper is available here: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4593702.svg)](https://doi.org/10.5281/zenodo.4593702)
137-
138-

0 commit comments

Comments
 (0)