diff --git a/HISTORY.md b/HISTORY.md
index 5a97c3c..f7b8561 100644
--- a/HISTORY.md
+++ b/HISTORY.md
@@ -1,6 +1,18 @@
History
=======
+1.5.0 (2023-09-20)
+------------------
+
+* Adds support for `pyrodigal-gv` implementing `prodigal-gv` as a gene predictor ([pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) and [prodigal-gv](https://github.com/apcamargo/prodigal-gv)). This can be specified with `-g prodigal-gv`.
+* Thanks to @[althonos](https://github.com/althonos) and @[apcamargo](https://github.com/apcamargo) for making this possible, and to @[asierFernandezP](https://github.com/asierFernandezP) for raising this as an issue in the first place [here](https://github.com/gbouras13/pharokka/issues/290) in #290.
+* Adds checks to determine if your input FASTA has duplicated contig headers from #293 [here](https://github.com/gbouras13/pharokka/issues/293). Thanks @[thauptfeld](https://github.com/thauptfeld) for raising this.
+* `-g prodigal` and `-g prodigal-gv` should be much faster thanks to multithread support added by the inimitable @althonos.
+* Genbank format output will be designated with PHG not VRL (following this issue https://github.com/RyanCook94/inphared/issues/22).
+* The `_length_gc_cds_density.tsv` and `_cds_final_merged_output.tsv` files now contain the translation table/genetic code for each contig (usually 11 but now not always if you use `pyrodigal-gv`).
+* `--skip_mash` flag added to skip finding the closest match for each contig in INPHARED using mash.
+* `--skip_extra_annotations` flag added to skip running tRNA-scanSE, MinCED and Aragorn in case you only want CDS predictions and functional annotations.
+
1.4.1 (2023-09-04)
------------------
diff --git a/README.md b/README.md
index 2969e9c..69ebfd2 100644
--- a/README.md
+++ b/README.md
@@ -23,6 +23,35 @@ Extra special thanks to Ghais Houtak for making Pharokka's logo.
If you are looking for rapid standardised annotation of bacterial genomes, please use [Bakta](https://github.com/oschwengers/bakta). [Prokka](https://github.com/tseemann/prokka), which inspired the creation and naming of `pharokka`, is another good option, but Bakta is [Prokka's worthy successor](https://twitter.com/torstenseemann/status/1565471892840259585).
+# Table of Contents
+
+- [pharokka](#pharokka)
+ - [Fast Phage Annotation Tool](#fast-phage-annotation-tool)
+- [Table of Contents](#table-of-contents)
+- [Quick Start](#quick-start)
+- [Documentation](#documentation)
+- [Paper](#paper)
+- [Pharokka with Galaxy Europe Webserver](#pharokka-with-galaxy-europe-webserver)
+- [Brief Overview](#brief-overview)
+ - [Pharokka v 1.5.0 Update (20 September 2023)](#pharokka-v-150-update-20-september-2023)
+ - [Pharokka v 1.4.0 Update (27 August 2023)](#pharokka-v-140-update-27-august-2023)
+ - [Pharokka v 1.3.0 Update](#pharokka-v-130-update)
+- [Installation](#installation)
+ - [Conda Installation](#conda-installation)
+ - [Pip](#pip)
+ - [Source](#source)
+- [Database Installation](#database-installation)
+- [Beginner Conda Installation](#beginner-conda-installation)
+- [Usage](#usage)
+- [Version Log](#version-log)
+- [System](#system)
+- [Time](#time)
+- [Benchmarking v1.5.0](#benchmarking-v150)
+- [Benchmarking v1.4.0](#benchmarking-v140)
+- [Original Benchmarking (v1.1.0)](#original-benchmarking-v110)
+- [Bugs and Suggestions](#bugs-and-suggestions)
+- [Citation](#citation)
+
# Quick Start
The easiest way to install `pharokka` is via conda:
@@ -65,11 +94,22 @@ So if you can't get `pharokka` to install on your machine for whatever reason or
-`pharokka` uses [PHANOTATE](https://github.com/deprekate/PHANOTATE), the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. [Prodigal](https://github.com/hyattpd/Prodigal) is also available as an alternative. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the [PHROGs](https://phrogs.lmge.uca.fr), [CARD](https://card.mcmaster.ca) and [VFDB](http://www.mgc.ac.cn/VFs/main.htm) databases using [MMseqs2](https://github.com/soedinglab/MMseqs2). As of v1.4.0, `pharokka` will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using [PyHMMER](https://github.com/althonos/pyhmmer). Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like [Roary](https://sanger-pathogens.github.io/Roary/). `pharokka` also generates a `cds_functions.tsv` file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full [usage](#usage) and check out the full [documentation](https://pharokka.readthedocs.io) for more details.
+`pharokka` uses [PHANOTATE](https://github.com/deprekate/PHANOTATE), the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. [Prodigal](https://github.com/hyattpd/Prodigal) implemented with [pyrodigal](https://github.com/althonos/pyrodigal) and [Prodigal-gv](https://github.com/apcamargo/prodigal-gv) implemented with [pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) are also available as alternatives. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the [PHROGs](https://phrogs.lmge.uca.fr), [CARD](https://card.mcmaster.ca) and [VFDB](http://www.mgc.ac.cn/VFs/main.htm) databases using [MMseqs2](https://github.com/soedinglab/MMseqs2). As of v1.4.0, `pharokka` will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using [PyHMMER](https://github.com/althonos/pyhmmer). Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like [Roary](https://sanger-pathogens.github.io/Roary/). `pharokka` also generates a `cds_functions.tsv` file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full [usage](#usage) and check out the full [documentation](https://pharokka.readthedocs.io) for more details.
+
+## Pharokka v 1.5.0 Update (20 September 2023)
+
+* Adds support for `pyrodigal-gv` implementing `prodigal-gv` as a gene predictor for alternate genetic codes ([pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) and [prodigal-gv](https://github.com/apcamargo/prodigal-gv)). This can be specified with `-g prodigal-gv` and is recommended for metagenomic input datasets. Thanks to @[althonos](https://github.com/althonos) and @[apcamargo](https://github.com/apcamargo) for making this possible, and to @[asierFernandezP](https://github.com/asierFernandezP) for raising this as an issue in the first place [here](https://github.com/gbouras13/pharokka/issues/290).
+* `-g prodigal` and `-g prodigal-gv` should be much faster thanks to multithread support added by the inimitable @[althonos](https://github.com/althonos).
+* Adds checks to determine if your input FASTA has duplicated [contig headers](https://github.com/gbouras13/pharokka/issues/293). Thanks @[thauptfeld](https://github.com/thauptfeld) for raising this.
+* Genbank format output will be designated with PHG not VRL.
+* The `_length_gc_cds_density.tsv` and `_cds_final_merged_output.tsv` files now contain the translation table/genetic code for each contig.
+* `--skip_mash` flag added to skip finding the closest match for each contig in INPHARED using mash.
+* `--skip_extra_annotations` flag added to skip running tRNA-scanSE, MinCED and Aragorn in case you only want CDS predictions and functional annotations.
+
## Pharokka v 1.4.0 Update (27 August 2023)
-Pharokka v1.4.0 is a large update implementing:
+`pharokka` v1.4.0 is a large update implementing:
* More sensitive search for PHROGs using Hidden Markov Models (HMMs) using the amazing [PyHMMER](https://github.com/althonos/pyhmmer).
* By default, `pharokka` will now run searches using both MMseqs2 (PHROGs, CARD and VFDB) and HMMs (PHROGs). MMseqs2 was kept for PHROGs as it provides more information than the HMM results (e.g. sequence alignment identities & top hit PHROG protein) if it finds a hit.
@@ -114,33 +154,6 @@ SAOMS1 phage (GenBank: MW460250.1) was isolated and sequenced by: Yerushalmy, O.
Please see [plotting](docs/plotting.md) for details on all plotting parameter options.
-# Table of Contents
-
-- [pharokka](#pharokka)
- - [Fast Phage Annotation Tool](#fast-phage-annotation-tool)
-- [Quick Start](#quick-start)
-- [Documentation](#documentation)
-- [Paper](#paper)
-- [Pharokka with Galaxy Europe Webserver](#pharokka-with-galaxy-europe-webserver)
-- [Brief Overview](#brief-overview)
- - [Pharokka v 1.4.0 Update (27 August 2023)](#pharokka-v-140-update-27-august-2023)
- - [Pharokka v 1.3.0 Update](#pharokka-v-130-update)
-- [Table of Contents](#table-of-contents)
-- [Installation](#installation)
- - [Conda Installation](#conda-installation)
- - [Pip](#pip)
- - [Source](#source)
-- [Database Installation](#database-installation)
-- [Beginner Conda Installation](#beginner-conda-installation)
-- [Usage](#usage)
-- [Version Log](#version-log)
-- [System](#system)
-- [Time](#time)
-- [Original Benchmarking (v1.1.0)](#original-benchmarking-v110)
-- [Benchmarking v1.4.0](#benchmarking-v140)
-- [Bugs and Suggestions](#bugs-and-suggestions)
-- [Citation](#citation)
-
# Installation
@@ -267,9 +280,9 @@ For a full explanation of all arguments, please see [usage](docs/run.md).
pharokka defaults to 1 thread.
```
-usage: pharokka.py [-h] [-i INFILE] [-o OUTDIR] [-d DATABASE] [-t THREADS] [-f] [-p PREFIX] [-l LOCUSTAG] [-g GENE_PREDICTOR] [-m] [-s] [-c CODING_TABLE]
- [-e EVALUE] [--fast] [--mmseqs2_only] [--meta_hmm] [--dnaapler] [--custom_hmm CUSTOM_HMM] [--genbank] [--terminase]
- [--terminase_strand TERMINASE_STRAND] [--terminase_start TERMINASE_START] [-V] [--citation]
+usage: pharokka.py [-h] [-i INFILE] [-o OUTDIR] [-d DATABASE] [-t THREADS] [-f] [-p PREFIX] [-l LOCUSTAG] [-g GENE_PREDICTOR] [-m] [-s] [-c CODING_TABLE] [-e EVALUE] [--fast] [--mmseqs2_only]
+ [--meta_hmm] [--dnaapler] [--custom_hmm CUSTOM_HMM] [--genbank] [--terminase] [--terminase_strand TERMINASE_STRAND] [--terminase_start TERMINASE_START]
+ [--skip_extra_annotations] [--skip_mash] [-V] [--citation]
pharokka: fast phage annotation program
@@ -289,13 +302,13 @@ options:
-l LOCUSTAG, --locustag LOCUSTAG
User specified locus tag for the gff/gbk files. This is not required. A random locus tag will be generated instead.
-g GENE_PREDICTOR, --gene_predictor GENE_PREDICTOR
- User specified gene predictor. Use "-g phanotate" or "-g prodigal".
+ User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank".
Defaults to phanotate (not required unless prodigal is desired).
-m, --meta meta mode for metavirome input samples
-s, --split split mode for metavirome samples. -m must also be specified.
Will output separate split FASTA, gff and genbank files for each input contig.
-c CODING_TABLE, --coding_table CODING_TABLE
- translation table for prodigal. Defaults to 11. Experimental only.
+ translation table for prodigal. Defaults to 11.
-e EVALUE, --evalue EVALUE
E-value threshold for MMseqs2 database PHROGs, VFDB and CARD and PyHMMER PHROGs database search. Defaults to 1E-05.
--fast, --hmm_only Runs PyHMMER (HMMs) with PHROGs only, not MMseqs2 with PHROGs, CARD or VFDB.
@@ -307,13 +320,17 @@ options:
--custom_hmm CUSTOM_HMM
Run pharokka with a custom HMM profile database suffixed .h3m.
Please use create this with the create_custom_hmm.py script.
- --genbank Flag denoting that -i/--input is a genbank file instead of the usual FASTA file
+ --genbank Flag denoting that -i/--input is a genbank file instead of the usual FASTA file.
+ The CDS calls in this file will be preserved and re-annotated.
--terminase Runs terminase large subunit re-orientation mode.
Single genome input only and requires --terminase_strand and --terminase_start to be specified.
--terminase_strand TERMINASE_STRAND
Strand of terminase large subunit. Must be "pos" or "neg".
--terminase_start TERMINASE_START
Start coordinate of the terminase large subunit.
+ --skip_extra_annotations
+ Skips tRNAscan-se, MINced and Aragorn.
+ --skip_mash Skips running mash to find the closest match for each contig in INPHARED.
-V, --version Print pharokka Version
--citation Print pharokka Citation
```
@@ -332,6 +349,53 @@ On a standard 16GB RAM laptop specifying 8 threads, `pharokka` should take betwe
In `--fast` mode, it should take 45-75 seconds.
+# Benchmarking v1.5.0
+
+`pharokka v1.5.0` was run on the 673 crAss phage dataset to showcase the improved CDS prediction of `-g prodigal-gv` for metagenomic datasets where some phages likely have alternative genetic codes (i.e. not 11).
+
+All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 8 threads (`-t 8`). `pyrodigal-gv v0.1.0` and `pyrodigal v3.0.0` were used respectively.
+
+| 673 crAss-like genomes | `pharokka` v1.5.0 `-g prodigal-gv` | `pharokka` v1.5.0 `-g prodigal` |
+|------------------------|------------------------------------|----------------------------------|
+| Total CDS | **81730** | 91999 |
+| Annotated Function CDS | **20344** | 17458 |
+| Unknown Function CDS | 61386 | 74541 |
+| Contigs with genetic code 15 | 229 | NA |
+| Contigs with genetic code 4 | 38 | NA |
+| Contigs with genetic code 11 | 406 | 673 |
+
+Fewer (larger) CDS were predicted more accurately, leading to an increase in the number of coding sequences with annotated functions. Approximately 40% of contigs in this dataset were predicted to use non-standard genetic codes according to `pyrodigal-gv`.
+
+# Benchmarking v1.4.0
+
+`pharokka` v1.4.0 has also been run on phage SAOMS1 and also the same 673 crAss phage dataset to showcase:
+
+1. The improved sensitivity of gene annotation with PyHMMER and a demonstration of how `--fast` is slower for metagenomes.
+ * If you can deal with the compute cost (especially for large metagenomes), I highly recommend `--fast` or `--meta_hmm` for metagenomes given how much more sensitive HMM search is.
+2. The large speed-up over v1.3.2 with `--fast` for phage isolates - with the proviso that no virulence factors or AMR genes will be detected.
+3. The slight speed-up over v1.3.2 with `--mmseqs2_only`.
+
+All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 16 threads (`-t 16`).
+
+SAOMS1 was run with Phanotate
+
+| Phage SAOMS1 | `pharokka` v1.4.0 `--fast` | `pharokka` v1.4.0 | `pharokka` v1.3.2 |
+|------------------------|-----------------------------|-------------------|-----------------|
+| Time (min) | 0.70 | 3.73 | 5.08 |
+| CDS | 246 | 246 | 246 |
+| Annotated Function CDS | 93 | 93 | 92 |
+| Unknown Function CDS | 153 | 153 | 154 |
+
+The 673 crAss-like genomes were run with `-m` (defaults to `--mmseqs2_only` in v 1.4.0) and with `-g prodigal` (pyrodigal v2.1.0).
+
+| 673 crAss-like genomes | `pharokka` v1.4.0 `--fast` | `pharokka` v1.4.0 `--mmseqs2_only` | `pharokka` v1.3.2 |
+|------------------------|---------------------------|----------------------------------|-----------------|
+| Time (min) | 35.62 | 11.05 | 13.27 |
+| CDS | 91999 | 91999 | 91999 |
+| Annotated Function CDS | **16713** | 9150 | 9150 |
+| Unknown Function CDS | 75286 | 82849 | 82849 |
+
+
# Original Benchmarking (v1.1.0)
`pharokka` (v1.1.0) has been benchmarked on an Intel Xeon CPU E5-4610 v2 @ 2.30 specifying 16 threads. Below is benchamarking comparing `pharokka` run with PHANOTATE and Prodigal against Prokka v1.14.6 run with PHROGs HMM profiles, as modified by Andrew Millard (https://millardlab.org/2021/11/21/phage-annotation-with-phrogs/).
@@ -372,35 +436,6 @@ For the crAss-like phage genomes, `pharokka` meta mode `-m` was enabled.
If you require fast annotations of extremely large datasets (i.e. thousands of input contigs), running `pharokka` with Prodigal (`-g prodigal`) is recommended.
-# Benchmarking v1.4.0
-
-`pharokka` v1.4.0 has also been run on phage SAOMS1 and also the same 673 crAss phage dataset to showcase:
-
-1. The improved sensitivity of gene annotation with PyHMMER and a demonstration of how `--fast` is slower for metagenomes.
- * If you can deal with the compute cost (especially for large metagenomes), I highly recommend `--fast` or `--meta_hmm` for metagenomes given how much more sensitive HMM search is.
-2. The large speed-up over v1.3.2 with `--fast` for phage isolates - with the proviso that no virulence factors or AMR genes will be detected.
-3. The slight speed-up over v1.3.2 with `--mmseqs2_only`.
-
-All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 16 threads (`-t 16`).
-
-SAOMS1 was run with Phanotate
-
-| Phage SAOMS1 | `pharokka` v1.4.0 `--fast` | `pharokka` v1.4.0 | `pharokka` v1.3.2 |
-|------------------------|-----------------------------|-------------------|-----------------|
-| Time (min) | 0.70 | 3.73 | 5.08 |
-| CDS | 246 | 246 | 246 |
-| Annotated Function CDS | 93 | 93 | 92 |
-| Unknown Function CDS | 153 | 153 | 154 |
-
-The 673 crAss-like genomes were run with `-m` (defaults to `--mmseqs2_only` in v 1.4.0) and with `-g prodigal` (pyrodigal v2.1.0).
-
-| 673 crAss-like genomes | `pharokka` v1.4.0 `--fast` | `pharokka` v1.4.0 `--mmseqs2_only` | `pharokka` v1.3.2 |
-|------------------------|---------------------------|----------------------------------|-----------------|
-| Time (min) | 35.62 | 11.05 | 13.27 |
-| CDS | 91999 | 91999 | 91999 |
-| Annotated Function CDS | **16713** | 9150 | 9150 |
-| Unknown Function CDS | 75286 | 82849 | 82849 |
-
# Bugs and Suggestions
@@ -414,7 +449,7 @@ If you use `pharokka`, I would recommend a citation in your manuscript along the
* All phages were annotated with Pharokka v ___ (Bouras, et al. 2023). Specifically, coding sequences (CDS) were predicted with PHANOTATE (McNair, et al. 2019), tRNAs were predicted with tRNAscan-SE 2.0 (Chan, et al. 2021), tmRNAs were predicted with Aragorn (Laslett, et al. 2004) and CRISPRs were preducted with CRT (Bland, et al. 2007). Functional annotation was generated by matching each CDS to the PHROGs (Terzian, et al. 2021), VFDB (Chen, et al. 2005) and CARD (Alcock, et al. 2020) databases using MMseqs2 (Steinegger, et al. 2017) and PyHMMER (Larralde, et al. 2023). Contigs were matched to their closest hit in the INPHARED database (Cook, et al. 2021) using mash (Ondov, et al. 2016). Plots were created with pyCirclize (Shimoyama 2022).
-With the following full citations for the constituent tools below:
+With the following full citations for the constituent tools below where relevant:
* Cook R, Brown N, Redgwell T, Rihtman B, Barnes M, Clokie M, Stekel DJ, Hobman JL, Jones MA, Millard A. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. 2021. Available from: http://doi.org/10.1089/phage.2021.0007.
* McNair K., Zhou C., Dinsdale E.A., Souza B., Edwards R.A. (2019) "PHANOTATE: a novel approach to gene identification in phage genomes", Bioinformatics, https://doi.org/10.1093/bioinformatics/btz26.
@@ -428,4 +463,5 @@ With the following full citations for the constituent tools below:
* Alcock et al, "CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database." Nucleic Acids Research (2020) https://doi.org/10.1093/nar/gkz935.
* Larralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296. doi:10.21105/joss.04296.
* Larralde M., Zeller G., (2023). PyHMMER: a Python library binding to HMMER for efficient sequence analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad214, https://doi.org/10.1093/bioinformatics/btad214.
-* Shimoyama, Y. (2022). pyCirclize: Circular visualization in Python [Computer software]. https://github.com/moshi4/pyCirclize
\ No newline at end of file
+* Larradle M. and Camargo A., (2023) Pyrodigal-gv: A Pyrodigal extension to predict genes in giant viruses and viruses with alternative genetic code. https://github.com/althonos/pyrodigal-gv.
+* Shimoyama, Y. (2022). pyCirclize: Circular visualization in Python [Computer software]. https://github.com/moshi4/pyCirclize.
\ No newline at end of file
diff --git a/bin/input_commands.py b/bin/input_commands.py
index e836e0e..cd7cbe0 100644
--- a/bin/input_commands.py
+++ b/bin/input_commands.py
@@ -4,9 +4,10 @@
import subprocess as sp
from argparse import RawTextHelpFormatter
+import pyrodigal
+import pyrodigal_gv
from Bio import SeqIO
from loguru import logger
-from pyrodigal import __version__
from util import get_version
@@ -60,7 +61,7 @@ def get_input():
"-g",
"--gene_predictor",
action="store",
- help='User specified gene predictor. Use "-g phanotate" or "-g prodigal". \nDefaults to phanotate (not required unless prodigal is desired).',
+ help='User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv" or "-g genbank". \nDefaults to phanotate (not required unless prodigal is desired).',
default="phanotate",
)
parser.add_argument(
@@ -78,7 +79,7 @@ def get_input():
parser.add_argument(
"-c",
"--coding_table",
- help="translation table for prodigal. Defaults to 11. Experimental only.",
+ help="translation table for prodigal. Defaults to 11.",
action="store",
default="11",
)
@@ -138,6 +139,16 @@ def get_input():
action="store",
default="nothing",
)
+ parser.add_argument(
+ "--skip_extra_annotations",
+ help="Skips tRNAscan-SE 2, MinCED and Aragorn.",
+ action="store_true",
+ ),
+ parser.add_argument(
+ "--skip_mash",
+ help="Skips running mash to find the closest match for each contig in INPHARED.",
+ action="store_true",
+ )
parser.add_argument(
"-V",
"--version",
@@ -197,23 +208,56 @@ def instantiate_dirs(output_dir, meta, force):
def validate_fasta(filename):
- if os.path.isfile(filename) == False: # if file doesnt exist
- logger.error(f"Error: Input file {filename} does not exist. Please check your input.")
+ if os.path.isfile(filename) == False: # if file doesnt exist
+ logger.error(
+ f"Error: Input file {filename} does not exist. Please check your input."
+ )
else:
with open(filename, "r") as handle:
fasta = SeqIO.parse(handle, "fasta")
- logger.info("Checking Input FASTA.")
+ logger.info(f"Checking input {filename}.")
if any(fasta):
- logger.info("FASTA checked.")
+ logger.info(f"Input {filename} is in FASTA format.")
else:
logger.error("Error: Input file is not in the FASTA format.")
+ # check for duplicate headers
+ logger.info(f"Checking input {filename} for duplicate FASTA headers.")
+ check_duplicate_headers(filename)
+ logger.info(f"All headers in {filename} are unique.")
+
+
+def check_duplicate_headers(fasta_file):
+ """
+ checks if there are duplicated in the FASTA header
+ in response to Tina's issue
+ https://github.com/gbouras13/pharokka/issues/293
+ """
+ header_set = set()
+
+ # Iterate through the FASTA file and check for duplicate headers
+ for record in SeqIO.parse(fasta_file, "fasta"):
+ header = record.description
+ if header in header_set:
+ logger.error(
+ f"Duplicate header found: {header}"
+ ) # errors if duplicate header found
+ else:
+ header_set.add(header)
+ # if it finished it will be fine
+
def validate_gene_predictor(gene_predictor, genbank_flag):
if gene_predictor == "phanotate":
logger.info("Phanotate will be used for gene prediction.")
elif gene_predictor == "prodigal":
- logger.info("Prodigal will be used for gene prediction.")
+ logger.info(
+ "Prodigal implemented with pyrodigal will be used for gene prediction."
+ )
+ elif gene_predictor == "prodigal-gv":
+ logger.info(
+ "Prodigal-gv implemented with pyrodigal-gv will be used for gene prediction."
+ )
elif gene_predictor == "genbank":
if genbank_flag is False:
logger.error(
@@ -221,7 +265,7 @@ def validate_gene_predictor(gene_predictor, genbank_flag):
)
else:
logger.error(
- "Error: gene predictor was incorrectly specified. Please use 'phanotate' or 'prodigal'."
+ "Error: gene predictor was incorrectly specified. Please use 'phanotate', 'prodigal' or 'prodigal-gv'."
)
@@ -300,8 +344,9 @@ def validate_threads(threads):
#######
-def check_dependencies():
+def check_dependencies(skip_mash):
"""Checks the dependencies and versions
+ skip_mash flag from args, won't check mash is skip mash specified
:return:
"""
#############
@@ -314,10 +359,10 @@ def check_dependencies():
except:
logger.error("Phanotate not found. Please reinstall pharokka.")
phan_out, _ = process.communicate()
- phanotate_out = phan_out.decode().strip()
- phanotate_major_version = int(phanotate_out.split(".")[0])
- phanotate_minor_version = int(phanotate_out.split(".")[1])
- phanotate_minorest_version = phanotate_out.split(".")[2]
+ phanotate_version = phan_out.decode().strip()
+ phanotate_major_version = int(phanotate_version.split(".")[0])
+ phanotate_minor_version = int(phanotate_version.split(".")[1])
+ phanotate_minorest_version = phanotate_version.split(".")[2]
logger.info(
f"Phanotate version found is v{phanotate_major_version}.{phanotate_minor_version}.{phanotate_minorest_version}"
@@ -471,25 +516,26 @@ def check_dependencies():
#############
# mash
#############
- try:
- process = sp.Popen(["mash", "--version"], stdout=sp.PIPE, stderr=sp.STDOUT)
- except:
- logger.error("mash not found. Please reinstall pharokka.")
+ if skip_mash is False:
+ try:
+ process = sp.Popen(["mash", "--version"], stdout=sp.PIPE, stderr=sp.STDOUT)
+ except:
+ logger.error("mash not found. Please reinstall pharokka.")
- mash_out, _ = process.communicate()
- mash_out = mash_out.decode().strip()
+ mash_out, _ = process.communicate()
+ mash_out = mash_out.decode().strip()
- mash_major_version = int(mash_out.split(".")[0])
- mash_minor_version = int(mash_out.split(".")[1])
+ mash_major_version = int(mash_out.split(".")[0])
+ mash_minor_version = int(mash_out.split(".")[1])
- logger.info(f"mash version found is v{mash_major_version}.{mash_minor_version}")
+ logger.info(f"mash version found is v{mash_major_version}.{mash_minor_version}")
- if mash_major_version != 2:
- logger.error("mash is the wrong version. Please re-install pharokka.")
- if mash_minor_version < 2:
- logger.error("mash is the wrong version. Please re-install pharokka.")
+ if mash_major_version != 2:
+ logger.error("mash is the wrong version. Please re-install pharokka.")
+ if mash_minor_version < 2:
+ logger.error("mash is the wrong version. Please re-install pharokka.")
- logger.info("mash version is ok.")
+ logger.info("mash version is ok.")
#############
# dnaapler
@@ -521,14 +567,37 @@ def check_dependencies():
# pyrodigal
#######
- pyrodigal_major_version = int(__version__.split(".")[0])
+ pyrodigal_version = pyrodigal.__version__
+ pyrodigal_major_version = int(pyrodigal_version.split(".")[0])
- if pyrodigal_major_version < 2:
+ if pyrodigal_major_version < 3:
logger.error("Pyrodigal is the wrong version. Please re-install pharokka.")
- logger.info(f"Pyrodigal version is v{__version__}")
+ logger.info(f"Pyrodigal version is v{pyrodigal_version}")
logger.info(f"Pyrodigal version is ok.")
+ #######
+ # pyrodigal gv
+ #######
+
+ pyrodigal_gv_version = pyrodigal_gv.__version__
+ pyrodigal_major_version = int(pyrodigal_gv_version.split(".")[0])
+
+ if pyrodigal_major_version < 0:
+ logger.error("Pyrodigal_gv is the wrong version. Please re-install pharokka.")
+
+ logger.info(f"Pyrodigal_gv version is v{pyrodigal_gv_version}")
+ logger.info(f"Pyrodigal_gv version is ok.")
+
+ return (
+ phanotate_version,
+ pyrodigal_version,
+ pyrodigal_gv_version,
+ trna_version,
+ aragorn_version,
+ minced_version,
+ )
+
def instantiate_split_output(out_dir, split):
"""
diff --git a/bin/pharokka.py b/bin/pharokka.py
index 699a235..6a843b4 100755
--- a/bin/pharokka.py
+++ b/bin/pharokka.py
@@ -22,8 +22,8 @@
run_dnaapler, run_mash_dist, run_mash_sketch,
run_minced, run_mmseqs, run_phanotate,
run_phanotate_fasta_meta, run_phanotate_txt_meta,
- run_pyrodigal, run_trna_scan, run_trnascan_meta,
- split_input_fasta, translate_fastas)
+ run_pyrodigal, run_pyrodigal_gv, run_trna_scan,
+ run_trnascan_meta, split_input_fasta, translate_fastas)
from util import count_contigs, get_version
@@ -106,7 +106,15 @@ def main():
# dependencies
logger.info("Checking dependencies.")
- check_dependencies()
+ # output versions
+ (
+ phanotate_version,
+ pyrodigal_version,
+ pyrodigal_gv_version,
+ trna_version,
+ aragorn_version,
+ minced_version,
+ ) = check_dependencies(args.skip_mash)
# instantiation/checking fasta and gene_predictor
if args.genbank is True:
@@ -275,24 +283,30 @@ def main():
run_phanotate(input_fasta, out_dir, logdir)
elif gene_predictor == "prodigal":
logger.info("Implementing Prodigal using Pyrodigal.")
- run_pyrodigal(input_fasta, out_dir, args.meta, args.coding_table)
+ run_pyrodigal(
+ input_fasta, out_dir, args.meta, args.coding_table, int(args.threads)
+ )
elif gene_predictor == "genbank":
logger.info("Extracting CDS information from your genbank file.")
+ elif gene_predictor == "prodigal-gv":
+ logger.info("Implementing Prodigal-gv using Pyrodigal-gv.")
+ run_pyrodigal_gv(input_fasta, out_dir, int(args.threads))
# translate fastas (parse genbank)
translate_fastas(out_dir, gene_predictor, args.coding_table, args.infile)
# run trna-scan meta mode if required
- if args.meta == True:
- logger.info("Starting tRNA-scanSE. Applying meta mode.")
- run_trnascan_meta(input_fasta, out_dir, args.threads, num_fastas)
- concat_trnascan_meta(out_dir, num_fastas)
- else:
- logger.info("Starting tRNA-scanSE.")
- run_trna_scan(input_fasta, args.threads, out_dir, logdir)
- # run minced and aragorn
- run_minced(input_fasta, out_dir, prefix, logdir)
- run_aragorn(input_fasta, out_dir, prefix, logdir)
+ if args.skip_extra_annotations is False:
+ if args.meta == True:
+ logger.info("Starting tRNA-scanSE. Applying meta mode.")
+ run_trnascan_meta(input_fasta, out_dir, args.threads, num_fastas)
+ concat_trnascan_meta(out_dir, num_fastas)
+ else:
+ logger.info("Starting tRNA-scanSE.")
+ run_trna_scan(input_fasta, args.threads, out_dir, logdir)
+ # run minced and aragorn
+ run_minced(input_fasta, out_dir, prefix, logdir)
+ run_aragorn(input_fasta, out_dir, prefix, logdir)
# running mmseqs2 on the 3 databases
if mmseqs_flag is True:
@@ -358,23 +372,36 @@ def main():
pharok.mmseqs_flag = mmseqs_flag
pharok.hmm_flag = hmm_flag
pharok.custom_hmm_flag = custom_hmm_flag
+ pharok.phanotate_version = phanotate_version
+ pharok.pyrodigal_version = pyrodigal_version
+ pharok.pyrodigal_gv_version = pyrodigal_gv_version
+ pharok.trna_version = trna_version
+ pharok.aragorn_version = aragorn_version
+ pharok.minced_version = minced_version
+ pharok.skip_extra_annotations = args.skip_extra_annotations
+
if pharok.hmm_flag is True:
pharok.pyhmmer_results_dict = best_results_pyhmmer
if pharok.custom_hmm_flag is True:
pharok.custom_pyhmmer_results_dict = best_results_custom_pyhmmer
+ #####################################
+ # post processing
+ #####################################
+
+ # gets df of length and gc for each contig
+ pharok.get_contig_name_lengths()
+
# post process results
# includes vfdb and card
# adds the merged df, vfdb and card top hits dfs to the class objec
# no need to specify params as they are in the class :)
pharok.process_results()
- # gets df of length and gc for each contig
- pharok.get_contig_name_lengths()
-
# parse the aragorn output
- # get flag whether there is a tmrna from aragor
- pharok.parse_aragorn()
+ # only if not skipping annots
+ if args.skip_extra_annotations is False:
+ pharok.parse_aragorn()
# create gff and save locustag to class for table
pharok.create_gff()
@@ -403,7 +430,7 @@ def main():
# convert to genbank
logger.info("Converting gff to genbank.")
# not part of the class so from processes.py
- convert_gff_to_gbk(input_fasta, out_dir, out_dir, prefix, args.coding_table)
+ convert_gff_to_gbk(input_fasta, out_dir, out_dir, prefix, pharok.prot_seq_df)
# update fasta headers and final output tsv
pharok.update_fasta_headers()
@@ -413,12 +440,16 @@ def main():
pharok.extract_terl()
# run mash
- logger.info("Finding the closest match for each contig in INPHARED using mash.")
- # in process.py
- run_mash_sketch(input_fasta, out_dir, logdir)
- run_mash_dist(out_dir, db_dir, logdir)
- # part of the class
- pharok.inphared_top_hits()
+ if args.skip_mash is False: # skips mash
+ logger.info("Finding the closest match for each contig in INPHARED using mash.")
+ # in process.py
+ run_mash_sketch(input_fasta, out_dir, logdir)
+ run_mash_dist(out_dir, db_dir, logdir)
+ # part of the class
+ pharok.inphared_top_hits()
+ else:
+ logger.info("You have chosen --skip_mash.")
+ logger.info("Skipping finding the closest match for each contig in INPHARED using mash.")
# delete tmp files
remove_post_processing_files(out_dir, gene_predictor, args.meta)
diff --git a/bin/post_processing.py b/bin/post_processing.py
index db5a096..ccdb1b8 100644
--- a/bin/post_processing.py
+++ b/bin/post_processing.py
@@ -51,6 +51,9 @@ def __init__(
),
gff_df: pd.DataFrame() = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]}),
locus_df: pd.DataFrame() = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]}),
+ prot_seq_df: pd.DataFrame() = pd.DataFrame(
+ {"col1": [1, 2, 3], "col2": [4, 5, 6]}
+ ),
tmrna_flag: bool = False,
trna_empty: bool = False,
crispr_count: int = 0,
@@ -58,6 +61,13 @@ def __init__(
mmseqs_flag: bool = True,
hmm_flag: bool = True,
custom_hmm_flag: bool = False,
+ phanotate_version: str = "1.5.0",
+ pyrodigal_version: str = "3.0.0",
+ pyrodigal_gv_version: str = "0.1.0",
+ trna_version: str = "2.0.12",
+ aragorn_version: str = "1.2.41",
+ minced_version: str = "0.4.2",
+ skip_extra_annotations: bool = False
) -> None:
"""
Parameters
@@ -104,6 +114,22 @@ def __init__(
whether HMM was run
custom_hmm_flag: bool
whether a custom db of HMMs was run
+ phanotate_version: str
+ phanotate_version from check_dependencies()
+ prodigal_version: str
+ prodigal_version from check_dependencies()
+ pyrodigal_gv_version: str
+ pyrodigal_gv_version from check dependencies()
+ trna_version: str
+ trnascan_version from check_dependencies()
+ aragorn_version: str
+ aragorn_version from check_dependencies()
+ minced_version: str
+ minced_version from check dependencies()
+ prot_seq_df: pd.DataFrame,
+ dataframe with protein sequence information for each egene
+ skip_extra_annotations: bool
+ boolean whether extra annotations are skipped
"""
self.out_dir = out_dir
self.db_dir = db_dir
@@ -127,6 +153,14 @@ def __init__(
self.mmseqs_flag = mmseqs_flag
self.hmm_flag = hmm_flag
self.custom_hmm_flag = custom_hmm_flag
+ self.phanotate_version = phanotate_version
+ self.pyrodigal_version = pyrodigal_version
+ self.pyrodigal_gv_version = pyrodigal_gv_version
+ self.trna_version = trna_version
+ self.aragorn_version = aragorn_version
+ self.minced_version = minced_version
+ self.prot_seq_df = prot_seq_df
+ self.skip_extra_annotations = skip_extra_annotations
def process_results(self):
"""
@@ -143,6 +177,34 @@ def process_results(self):
cds_file = os.path.join(self.out_dir, "cleaned_" + self.gene_predictor + ".tsv")
cds_df = pd.read_csv(cds_file, sep="\t", index_col=False)
+ ###########################################
+ # add the sequence to the df for the genbank conversion later on
+ fasta_input_aas_tmp = os.path.join(
+ self.out_dir, f"{self.gene_predictor}_aas_tmp.fasta"
+ )
+ prot_dict = SeqIO.to_dict(SeqIO.parse(fasta_input_aas_tmp, "fasta"))
+
+ # make a copy of cds_df
+ self.prot_seq_df = cds_df.copy()
+
+ # to match the output for gff
+ self.prot_seq_df[["gene", "st"]] = self.prot_seq_df["gene"].str.split(
+ " ", expand=True
+ )
+
+ self.prot_seq_df = self.prot_seq_df.drop(columns=["st"])
+
+ # get sequences for each gene in df
+ self.prot_seq_df["sequence"] = "MA"
+
+ for index, row in self.prot_seq_df.iterrows():
+ # get the gene id
+ gene = row["gene"]
+ if gene in prot_dict.keys():
+ # add the AA sequence
+ self.prot_seq_df.at[index, "sequence"] = prot_dict[gene].seq
+
+ ##########################################
# create the tophits_df and write it to file
if self.mmseqs_flag is True:
tophits_df = create_mmseqs_tophits(self.out_dir)
@@ -248,11 +310,13 @@ def process_results(self):
# add columns
if self.gene_predictor == "phanotate":
- merged_df["Method"] = "PHANOTATE"
+ merged_df["Method"] = f"PHANOTATE_{self.phanotate_version}"
elif self.gene_predictor == "prodigal":
- merged_df["Method"] = "PRODIGAL"
+ merged_df["Method"] = f"Pyrodigal_{self.pyrodigal_version}"
elif self.gene_predictor == "genbank":
merged_df["Method"] = "CUSTOM"
+ elif self.gene_predictor == "prodigal-gv":
+ merged_df["Method"] = f"Pyrodigal-gv_{self.pyrodigal_gv_version}"
merged_df["Region"] = "CDS"
# # replace with No_PHROG if nothing found
@@ -301,23 +365,91 @@ def process_results(self):
def get_contig_name_lengths(self):
"""
- Gets contig name and length in the input fasta file and calculates gc
+ Gets contig name and length in the input fasta file and calculates gc.
+ Also adds translation table
:param fasta_input: input fasta file
:return: length_df a pandas dataframe (to class)
"""
+
fasta_sequences = SeqIO.parse(open(self.input_fasta), "fasta")
+
+ if self.gene_predictor == "prodigal-gv":
+ # define col list
+ col_list = [
+ "contig",
+ "Method",
+ "Region",
+ "start",
+ "stop",
+ "score",
+ "frame",
+ "phase",
+ "attributes",
+ ]
+ # read gff (no fasta output)
+ pyrodigal_gv_gff = pd.read_csv(
+ os.path.join(self.out_dir, "prodigal-gv_out.gff"),
+ delimiter="\t",
+ index_col=False,
+ names=col_list,
+ )
+
+ pyrodigal_gv_gff[["attributes", "transl_table"]] = pyrodigal_gv_gff[
+ "attributes"
+ ].str.split("transl_table=", expand=True)
+ pyrodigal_gv_gff[["transl_table", "rest"]] = pyrodigal_gv_gff[
+ "transl_table"
+ ].str.split(";conf", expand=True)
+ # drop and then remove duplicates in df
+ pyrodigal_gv_gff = pyrodigal_gv_gff.drop(
+ columns=[
+ "rest",
+ "Method",
+ "Region",
+ "start",
+ "stop",
+ "score",
+ "frame",
+ "phase",
+ "attributes",
+ ]
+ )
+ # Remove duplicate rows based on all columns
+ pyrodigal_gv_gff = pyrodigal_gv_gff.drop_duplicates()
+ # Convert to a dictionary
+ transl_table_dict = pyrodigal_gv_gff.set_index("contig")[
+ "transl_table"
+ ].to_dict()
+
contig_names = []
lengths = []
gc = []
- for fasta in fasta_sequences:
- contig_names.append(fasta.id)
- lengths.append(len(fasta.seq))
- gc.append(round(GC(fasta.seq), 2))
+ transl_tables = []
+
+ transl_table = "11"
+ if self.gene_predictor == "phanotate":
+ transl_table = "11"
+ elif self.gene_predictor == "prodigal":
+ transl_table = self.coding_table
+ elif self.gene_predictor == "genbank":
+ transl_table = "custom_gene_calls_from_genbank"
+
+ for record in fasta_sequences:
+ contig_names.append(record.id)
+ lengths.append(len(record.seq))
+ gc.append(round(GC(record.seq), 2))
+ # pyrodigal-gv lookup from teh dict
+ if self.gene_predictor == "prodigal-gv":
+ transl_table = transl_table_dict[record.id]
+
+ transl_tables.append(transl_table)
+
length_df = pd.DataFrame(
{
"contig": contig_names,
"length": lengths,
"gc_perc": gc,
+ "transl_table": transl_tables,
}
)
self.length_df = length_df
@@ -370,7 +502,7 @@ def parse_aragorn(self):
split = line.split()
start_stops = split[2].replace("[", "").replace("]", "").split(",")
contig = self.length_df["contig"][0]
- method = "Aragorn"
+ method = f"Aragorn_{self.aragorn_version}"
region = "tmRNA"
start = start_stops[0].replace(
"c", ""
@@ -430,7 +562,7 @@ def parse_aragorn(self):
split[2].replace("[", "").replace("]", "").split(",")
)
contig = self.length_df["contig"][j]
- method = "Aragorn"
+ method = f"Aragorn_{self.aragorn_version}"
region = "tmRNA"
start = start_stops[0].replace(
"c", ""
@@ -516,6 +648,10 @@ def create_gff(self):
# get all contigs
contigs = self.length_df["contig"].astype("string")
+ # add the translation table
+ transl_table_df = self.length_df.drop(columns=["length", "gc_perc"])
+ self.merged_df = self.merged_df.merge(transl_table_df, how="left", on="contig")
+
############ locus tag #########
# write df for locus tag parsing
# zfill - makes the CDS 4 digits trailing zeroes for vcontact
@@ -550,6 +686,7 @@ def create_gff(self):
# assign count and locus_tag to merged_df (for meta)
self.merged_df["locus_tag"] = locus_df["locus_tag"]
self.merged_df["count"] = locus_df["count"]
+
#################################
#########
@@ -573,6 +710,9 @@ def create_gff(self):
"ID="
+ locus_df["locus_tag"].astype(str)
+ ";"
+ + "transl_table="
+ + locus_df["transl_table"].astype(str)
+ + ";"
+ "phrog="
+ self.merged_df["phrog"].astype(str)
+ ";"
@@ -650,215 +790,231 @@ def create_gff(self):
### trnas
# check if no trnas
- col_list = [
- "contig",
- "Method",
- "Region",
- "start",
- "stop",
- "score",
- "frame",
- "phase",
- "attributes",
- ]
- trna_empty = is_trna_empty(self.out_dir)
- if trna_empty == False:
- trna_df = pd.read_csv(
- os.path.join(self.out_dir, "trnascan_out.gff"),
- delimiter="\t",
- index_col=False,
- names=col_list,
- )
- # index hack if meta mode
- if self.meta_mode == True:
- subset_dfs = []
- for contig in contigs:
- subset_df = trna_df[trna_df["contig"] == contig].reset_index()
- # keep only trnas before indexing
- subset_df = subset_df[
- (subset_df["Region"] == "tRNA")
- | (subset_df["Region"] == "pseudogene")
+
+ # to make sure you aren't skipping trnas
+ if self.skip_extra_annotations is False:
+ col_list = [
+ "contig",
+ "Method",
+ "Region",
+ "start",
+ "stop",
+ "score",
+ "frame",
+ "phase",
+ "attributes",
+ ]
+ trna_empty = is_trna_empty(self.out_dir)
+ if trna_empty == False:
+ trna_df = pd.read_csv(
+ os.path.join(self.out_dir, "trnascan_out.gff"),
+ delimiter="\t",
+ index_col=False,
+ names=col_list,
+ )
+
+ # convert the method to update with version
+ trna_df["Method"] = f"tRNAscan-SE_{self.trna_version}"
+
+ # index hack if meta mode
+ if self.meta_mode == True:
+ subset_dfs = []
+ for contig in contigs:
+ subset_df = trna_df[trna_df["contig"] == contig].reset_index()
+ # keep only trnas before indexing
+ subset_df = subset_df[
+ (subset_df["Region"] == "tRNA")
+ | (subset_df["Region"] == "pseudogene")
+ ]
+ subset_df = subset_df.reset_index(drop=True)
+ subset_df["count"] = subset_df.index
+ # so not 0 indexed
+ subset_df["count"] = subset_df["count"] + 1
+ # z fill to make the locus tag 4
+ subset_df["locus_tag"] = (
+ contig + "_tRNA_" + subset_df["count"].astype(str).str.zfill(4)
+ )
+ subset_df = subset_df.drop(columns=["count"])
+ subset_dfs.append(subset_df)
+ trna_df = pd.concat(subset_dfs, axis=0, ignore_index=True)
+ trna_df = trna_df.drop(columns=["index"])
+ else:
+ # keep only trnas
+ trna_df = trna_df[
+ (trna_df["Region"] == "tRNA") | (trna_df["Region"] == "pseudogene")
]
- subset_df = subset_df.reset_index(drop=True)
- subset_df["count"] = subset_df.index
- # so not 0 indexed
- subset_df["count"] = subset_df["count"] + 1
- # z fill to make the locus tag 4
- subset_df["locus_tag"] = (
- contig + "_tRNA_" + subset_df["count"].astype(str).str.zfill(4)
+ trna_df = trna_df.reset_index(drop=True)
+ trna_df["count"] = trna_df.index
+ trna_df["count"] = trna_df["count"] + 1
+ trna_df["locus_tag"] = (
+ self.locustag + "_tRNA_" + trna_df["count"].astype(str).str.zfill(4)
)
- subset_df = subset_df.drop(columns=["count"])
- subset_dfs.append(subset_df)
- trna_df = pd.concat(subset_dfs, axis=0, ignore_index=True)
- trna_df = trna_df.drop(columns=["index"])
- else:
- # keep only trnas
- trna_df = trna_df[
- (trna_df["Region"] == "tRNA") | (trna_df["Region"] == "pseudogene")
- ]
- trna_df = trna_df.reset_index(drop=True)
- trna_df["count"] = trna_df.index
- trna_df["count"] = trna_df["count"] + 1
- trna_df["locus_tag"] = (
- self.locustag + "_tRNA_" + trna_df["count"].astype(str).str.zfill(4)
- )
- trna_df = trna_df.drop(columns=["count"])
+ trna_df = trna_df.drop(columns=["count"])
- trna_df.start = trna_df.start.astype(int)
- trna_df.stop = trna_df.stop.astype(int)
- trna_df[["attributes", "isotypes"]] = trna_df["attributes"].str.split(
- ";isotype=", expand=True
- )
- trna_df[["isotypes", "anticodon"]] = trna_df["isotypes"].str.split(
- ";anticodon=", expand=True
- )
- trna_df[["anticodon", "rest"]] = trna_df["anticodon"].str.split(
- ";gene_biotype", expand=True
- )
- trna_df["trna_product"] = (
- "tRNA-" + trna_df["isotypes"] + "(" + trna_df["anticodon"] + ")"
- )
- trna_df = trna_df.drop(columns=["attributes"])
- trna_df["attributes"] = (
- "ID="
- + trna_df["locus_tag"]
- + ";"
- + "trna="
- + trna_df["trna_product"].astype(str)
- + ";"
- + "isotype="
- + trna_df["isotypes"].astype(str)
- + ";"
- + "anticodon="
- + trna_df["anticodon"].astype(str)
- + ";"
- + "locus_tag="
- + trna_df["locus_tag"]
- )
- trna_df = trna_df.drop(
- columns=["isotypes", "anticodon", "rest", "trna_product", "locus_tag"]
- )
+ trna_df.start = trna_df.start.astype(int)
+ trna_df.stop = trna_df.stop.astype(int)
+ trna_df[["attributes", "isotypes"]] = trna_df["attributes"].str.split(
+ ";isotype=", expand=True
+ )
+ trna_df[["isotypes", "anticodon"]] = trna_df["isotypes"].str.split(
+ ";anticodon=", expand=True
+ )
+ trna_df[["anticodon", "rest"]] = trna_df["anticodon"].str.split(
+ ";gene_biotype", expand=True
+ )
+ trna_df["trna_product"] = (
+ "tRNA-" + trna_df["isotypes"] + "(" + trna_df["anticodon"] + ")"
+ )
+ trna_df = trna_df.drop(columns=["attributes"])
+ trna_df["attributes"] = (
+ "ID="
+ + trna_df["locus_tag"]
+ + ";"
+ + "transl_table="
+ + locus_df["transl_table"].astype(str)
+ + ";"
+ + "trna="
+ + trna_df["trna_product"].astype(str)
+ + ";"
+ + "isotype="
+ + trna_df["isotypes"].astype(str)
+ + ";"
+ + "anticodon="
+ + trna_df["anticodon"].astype(str)
+ + ";"
+ + "locus_tag="
+ + trna_df["locus_tag"]
+ )
+ trna_df = trna_df.drop(
+ columns=["isotypes", "anticodon", "rest", "trna_product", "locus_tag"]
+ )
- ### crisprs
- crispr_count = get_crispr_count(self.out_dir, self.prefix)
- # add to gff if > 0
- if crispr_count > 0:
- minced_df = pd.read_csv(
- os.path.join(self.out_dir, self.prefix + "_minced.gff"),
- delimiter="\t",
- index_col=False,
- names=col_list,
- comment="#",
- )
- minced_df.start = minced_df.start.astype(int)
- minced_df.stop = minced_df.stop.astype(int)
- minced_df[["attributes", "rpt_unit_seq"]] = minced_df[
- "attributes"
- ].str.split(";rpt_unit_seq=", expand=True)
- minced_df[["attributes", "rpt_family"]] = minced_df["attributes"].str.split(
- ";rpt_family=", expand=True
- )
- minced_df[["attributes", "rpt_type"]] = minced_df["attributes"].str.split(
- ";rpt_type=", expand=True
- )
- minced_df = minced_df.drop(columns=["attributes"])
- # index hack if meta mode
- subset_dfs = []
- if self.meta_mode == True:
- for contig in contigs:
- subset_df = minced_df[minced_df["contig"] == contig].reset_index()
- subset_df["count"] = subset_df.index
- # so not 0 indexed
- subset_df["count"] = subset_df["count"] + 1
- # z fill to make the locus tag 4
- subset_df["count"] = subset_df["count"].astype(str).str.zfill(4)
- subset_df["locus_tag"] = (
- contig
+ ### crisprs
+ crispr_count = get_crispr_count(self.out_dir, self.prefix)
+ # add to gff if > 0
+ if crispr_count > 0:
+ minced_df = pd.read_csv(
+ os.path.join(self.out_dir, self.prefix + "_minced.gff"),
+ delimiter="\t",
+ index_col=False,
+ names=col_list,
+ comment="#",
+ )
+ minced_df.start = minced_df.start.astype(int)
+ minced_df.stop = minced_df.stop.astype(int)
+ minced_df[["attributes", "rpt_unit_seq"]] = minced_df[
+ "attributes"
+ ].str.split(";rpt_unit_seq=", expand=True)
+ minced_df[["attributes", "rpt_family"]] = minced_df["attributes"].str.split(
+ ";rpt_family=", expand=True
+ )
+ minced_df[["attributes", "rpt_type"]] = minced_df["attributes"].str.split(
+ ";rpt_type=", expand=True
+ )
+ minced_df = minced_df.drop(columns=["attributes"])
+ # index hack if meta mode
+ subset_dfs = []
+ if self.meta_mode == True:
+ for contig in contigs:
+ subset_df = minced_df[minced_df["contig"] == contig].reset_index()
+ subset_df["count"] = subset_df.index
+ # so not 0 indexed
+ subset_df["count"] = subset_df["count"] + 1
+ # z fill to make the locus tag 4
+ subset_df["count"] = subset_df["count"].astype(str).str.zfill(4)
+ subset_df["locus_tag"] = (
+ contig
+ + "_CRISPR_"
+ + subset_df["count"].astype(str).str.zfill(4)
+ )
+ subset_df = subset_df.drop(columns=["count"])
+ subset_dfs.append(subset_df)
+ minced_df = pd.concat(subset_dfs, axis=0, ignore_index=True)
+ minced_df = minced_df.drop(columns=["index"])
+ else:
+ minced_df["count"] = minced_df.index
+ minced_df["count"] = minced_df["count"] + 1
+ minced_df["locus_tag"] = (
+ self.locustag
+ "_CRISPR_"
- + subset_df["count"].astype(str).str.zfill(4)
+ + minced_df["count"].astype(str).str.zfill(4)
)
- subset_df = subset_df.drop(columns=["count"])
- subset_dfs.append(subset_df)
- minced_df = pd.concat(subset_dfs, axis=0, ignore_index=True)
- minced_df = minced_df.drop(columns=["index"])
- else:
- minced_df["count"] = minced_df.index
- minced_df["count"] = minced_df["count"] + 1
- minced_df["locus_tag"] = (
- self.locustag
- + "_CRISPR_"
- + minced_df["count"].astype(str).str.zfill(4)
+ minced_df = minced_df.drop(columns=["count"])
+
+ minced_df["attributes"] = (
+ "ID="
+ + minced_df["locus_tag"]
+ + ";"
+ + "transl_table="
+ + locus_df["transl_table"].astype(str)
+ + ";"
+ + "rpt_type="
+ + minced_df["rpt_type"].astype(str)
+ + ";"
+ + "rpt_family="
+ + minced_df["rpt_family"].astype(str)
+ + ";"
+ + "rpt_unit_seq="
+ + minced_df["rpt_unit_seq"].astype(str)
+ + ";"
+ + "locus_tag="
+ + minced_df["locus_tag"]
+ )
+ minced_df = minced_df.drop(
+ columns=["rpt_unit_seq", "rpt_family", "rpt_type", "locus_tag"]
)
- minced_df = minced_df.drop(columns=["count"])
-
- minced_df["attributes"] = (
- "ID="
- + minced_df["locus_tag"]
- + ";"
- + "rpt_type="
- + minced_df["rpt_type"].astype(str)
- + ";"
- + "rpt_family="
- + minced_df["rpt_family"].astype(str)
- + ";"
- + "rpt_unit_seq="
- + minced_df["rpt_unit_seq"].astype(str)
- + ";"
- + "locus_tag="
- + minced_df["locus_tag"]
- )
- minced_df = minced_df.drop(
- columns=["rpt_unit_seq", "rpt_family", "rpt_type", "locus_tag"]
- )
- ### tmrna
- # add to gff there is a tmrna
- if self.tmrna_flag == True:
- tmrna_df = pd.read_csv(
- os.path.join(self.out_dir, self.prefix + "_aragorn.gff"),
- delimiter="\t",
- index_col=False,
- names=col_list,
- )
- tmrna_df.start = tmrna_df.start.astype(int)
- tmrna_df.stop = tmrna_df.stop.astype(int)
+ ### tmrna
+ # add to gff there is a tmrna
+ if self.tmrna_flag == True:
+ tmrna_df = pd.read_csv(
+ os.path.join(self.out_dir, self.prefix + "_aragorn.gff"),
+ delimiter="\t",
+ index_col=False,
+ names=col_list,
+ )
+ tmrna_df.start = tmrna_df.start.astype(int)
+ tmrna_df.stop = tmrna_df.stop.astype(int)
- # index hack if meta mode
- subset_dfs = []
- if self.meta_mode == True:
- for contig in contigs:
- subset_df = tmrna_df[tmrna_df["contig"] == contig].reset_index()
- subset_df["count"] = subset_df.index
- # so not 0 indexed
- subset_df["count"] = subset_df["count"] + 1
- # z fill to make the locus tag 4
- subset_df["count"] = subset_df["count"].astype(str).str.zfill(4)
- subset_df["locus_tag"] = (
- contig + "_tmRNA_" + subset_df["count"].astype(str).str.zfill(4)
+ # index hack if meta mode
+ subset_dfs = []
+ if self.meta_mode == True:
+ for contig in contigs:
+ subset_df = tmrna_df[tmrna_df["contig"] == contig].reset_index()
+ subset_df["count"] = subset_df.index
+ # so not 0 indexed
+ subset_df["count"] = subset_df["count"] + 1
+ # z fill to make the locus tag 4
+ subset_df["count"] = subset_df["count"].astype(str).str.zfill(4)
+ subset_df["locus_tag"] = (
+ contig + "_tmRNA_" + subset_df["count"].astype(str).str.zfill(4)
+ )
+ subset_df = subset_df.drop(columns=["count"])
+ subset_dfs.append(subset_df)
+ tmrna_df = pd.concat(subset_dfs, axis=0, ignore_index=True)
+ tmrna_df = tmrna_df.drop(columns=["index"])
+ else:
+ tmrna_df["count"] = tmrna_df.index
+ tmrna_df["count"] = tmrna_df["count"] + 1
+ tmrna_df["locus_tag"] = (
+ self.locustag
+ + "_tmRNA_"
+ + tmrna_df["count"].astype(str).str.zfill(4)
)
- subset_df = subset_df.drop(columns=["count"])
- subset_dfs.append(subset_df)
- tmrna_df = pd.concat(subset_dfs, axis=0, ignore_index=True)
- tmrna_df = tmrna_df.drop(columns=["index"])
- else:
- tmrna_df["count"] = tmrna_df.index
- tmrna_df["count"] = tmrna_df["count"] + 1
- tmrna_df["locus_tag"] = (
- self.locustag
- + "_tmRNA_"
- + tmrna_df["count"].astype(str).str.zfill(4)
+ tmrna_df = tmrna_df.drop(columns=["count"])
+
+ tmrna_df["attributes"] = (
+ "ID="
+ + tmrna_df["locus_tag"]
+ + ";"
+ + "transl_table="
+ + locus_df["transl_table"].astype(str)
+ + ";"
+ + tmrna_df["attributes"].astype(str)
+ + ";locus_tag="
+ + tmrna_df["locus_tag"]
)
- tmrna_df = tmrna_df.drop(columns=["count"])
-
- tmrna_df["attributes"] = (
- "ID="
- + tmrna_df["locus_tag"]
- + ";"
- + tmrna_df["attributes"].astype(str)
- + ";locus_tag="
- + tmrna_df["locus_tag"]
- )
- tmrna_df = tmrna_df.drop(columns=["locus_tag"])
+ tmrna_df = tmrna_df.drop(columns=["locus_tag"])
# write header of final gff files
with open(os.path.join(self.out_dir, self.prefix + ".gff"), "w") as f:
@@ -874,23 +1030,27 @@ def create_gff(self):
# combine dfs depending on whether the elements were detected
- if trna_empty is True and self.tmrna_flag is False and crispr_count == 0: # all
+ # skip extra trna setc
+ if self.skip_extra_annotations is True:
df_list = [gff_df]
- elif trna_empty is False and self.tmrna_flag is False and crispr_count == 0:
- df_list = [gff_df, trna_df]
- elif trna_empty is True and self.tmrna_flag is True and crispr_count == 0:
- df_list = [gff_df, tmrna_df]
- elif trna_empty is True and self.tmrna_flag is False and crispr_count > 0:
- df_list = [gff_df, minced_df]
- elif trna_empty is False and self.tmrna_flag is True and crispr_count == 0:
- df_list = [gff_df, trna_df, tmrna_df]
- elif trna_empty is False and self.tmrna_flag is False and crispr_count > 0:
- df_list = [gff_df, trna_df, minced_df]
- elif trna_empty is True and self.tmrna_flag is True and crispr_count > 0:
- df_list = [gff_df, tmrna_df, minced_df]
- # if trna_empty is False and self.tmrna_flag is True and crispr_count > 0: # all detected
- else: # all detected
- df_list = [gff_df, trna_df, tmrna_df, minced_df]
+ else:
+ if trna_empty is True and self.tmrna_flag is False and crispr_count == 0: # all
+ df_list = [gff_df]
+ elif trna_empty is False and self.tmrna_flag is False and crispr_count == 0:
+ df_list = [gff_df, trna_df]
+ elif trna_empty is True and self.tmrna_flag is True and crispr_count == 0:
+ df_list = [gff_df, tmrna_df]
+ elif trna_empty is True and self.tmrna_flag is False and crispr_count > 0:
+ df_list = [gff_df, minced_df]
+ elif trna_empty is False and self.tmrna_flag is True and crispr_count == 0:
+ df_list = [gff_df, trna_df, tmrna_df]
+ elif trna_empty is False and self.tmrna_flag is False and crispr_count > 0:
+ df_list = [gff_df, trna_df, minced_df]
+ elif trna_empty is True and self.tmrna_flag is True and crispr_count > 0:
+ df_list = [gff_df, tmrna_df, minced_df]
+ # if trna_empty is False and self.tmrna_flag is True and crispr_count > 0: # all detected
+ else: # all detected
+ df_list = [gff_df, trna_df, tmrna_df, minced_df]
total_gff = pd.concat(df_list, ignore_index=True)
@@ -899,7 +1059,7 @@ def create_gff(self):
total_gff.stop = total_gff.stop.astype(int)
# sorts all features by start
- total_gff = total_gff.groupby(["contig"], sort=False, as_index=False).apply(
+ total_gff = total_gff.groupby(["contig"], sort=False, as_index=False, group_keys=True).apply(
pd.DataFrame.sort_values, "start", ascending=True
)
@@ -923,8 +1083,13 @@ def create_gff(self):
self.locus_df = locus_df
self.gff_df = gff_df
self.total_gff = total_gff
- self.trna_empty = trna_empty
- self.crispr_count = crispr_count
+ if self.skip_extra_annotations is False:
+ self.trna_empty = trna_empty
+ self.crispr_count = crispr_count
+ else: # skip annotations
+ self.trna_empty = True
+ self.crispr_count = 0
+ self.tmrna_flag = False
def create_tbl(
self,
@@ -944,10 +1109,20 @@ def create_tbl(
# get the cds
+ self.total_gff = self.total_gff.reset_index(drop=True)
+
if self.gene_predictor == "phanotate":
- cds_df = self.total_gff[self.total_gff["Method"] == "PHANOTATE"]
+ cds_df = self.total_gff[
+ self.total_gff["Method"] == f"PHANOTATE_{self.phanotate_version}"
+ ]
elif self.gene_predictor == "prodigal":
- cds_df = self.total_gff[self.total_gff["Method"] == "PRODIGAL"]
+ cds_df = self.total_gff[
+ self.total_gff["Method"] == f"Pyrodigal_{self.pyrodigal_version}"
+ ]
+ elif self.gene_predictor == "prodigal-gv":
+ cds_df = self.total_gff[
+ self.total_gff["Method"] == f"Pyrodigal-gv_{self.pyrodigal_gv_version}"
+ ]
elif self.gene_predictor == "genbank":
cds_df = self.total_gff[self.total_gff["Method"] == "CUSTOM"]
@@ -958,10 +1133,13 @@ def create_tbl(
";function=", expand=True
)
+
### trnas
# check if no trnas
- if self.trna_empty == False:
- trna_df = self.total_gff[self.total_gff["Method"] == "tRNAscan-SE"]
+ if self.trna_empty is False:
+ trna_df = self.total_gff[
+ self.total_gff["Method"] == f"tRNAscan-SE_{self.trna_version}"
+ ]
# keep only trnas and pseudogenes
trna_df.start = trna_df.start.astype(int)
trna_df.stop = trna_df.stop.astype(int)
@@ -991,7 +1169,7 @@ def create_tbl(
].str.split(";rpt_unit_seq=", expand=True)
### TMRNAs
- if self.tmrna_flag == True:
+ if self.tmrna_flag is True:
tmrna_df = self.total_gff[self.total_gff["Region"] == "tmRNA"]
tmrna_df.start = tmrna_df.start.astype(int)
tmrna_df.stop = tmrna_df.stop.astype(int)
@@ -1045,7 +1223,7 @@ def create_tbl(
+ "\t"
+ "transl_table"
+ "\t"
- + str(self.coding_table)
+ + str(subset_df["transl_table"])
+ "\n"
)
if self.trna_empty == False:
@@ -1090,7 +1268,7 @@ def create_tbl(
+ "\t"
+ "transl_table"
+ "\t"
- + str(self.coding_table)
+ + str(subset_df["transl_table"])
+ "\n"
)
if self.crispr_count > 0:
@@ -1135,7 +1313,7 @@ def create_tbl(
+ "\t"
+ "transl_table"
+ "\t"
- + str(self.coding_table)
+ + str(subset_df["transl_table"])
+ "\n"
)
if self.tmrna_flag == True:
@@ -1180,7 +1358,7 @@ def create_tbl(
+ "\t"
+ "transl_table"
+ "\t"
- + str(self.coding_table)
+ + str(subset_df["transl_table"])
+ "\n"
)
@@ -1245,7 +1423,7 @@ def convert_singles_gff_to_gbk(self):
)
contig = row["contig"]
convert_gff_to_gbk(
- fasta_file, single_gff_dir, single_gbk_dir, contig, self.coding_table
+ fasta_file, single_gff_dir, single_gbk_dir, contig, self.prot_seq_df
)
def split_fasta_singles(self):
@@ -1385,44 +1563,45 @@ def create_txt(self):
#### trna scan
# read in trnascan
- col_list = [
- "contig",
- "Method",
- "Region",
- "start",
- "stop",
- "score",
- "frame",
- "phase",
- "attributes",
- ]
- trna_df = pd.read_csv(
- os.path.join(self.out_dir, "trnascan_out.gff"),
- delimiter="\t",
- index_col=False,
- names=col_list,
- )
- # keep only trnas and pseudogenes
- trna_df = trna_df[
- (trna_df["Region"] == "tRNA") | (trna_df["Region"] == "pseudogene")
- ]
+ if self.skip_extra_annotations is False:
+ col_list = [
+ "contig",
+ "Method",
+ "Region",
+ "start",
+ "stop",
+ "score",
+ "frame",
+ "phase",
+ "attributes",
+ ]
+ trna_df = pd.read_csv(
+ os.path.join(self.out_dir, "trnascan_out.gff"),
+ delimiter="\t",
+ index_col=False,
+ names=col_list,
+ )
+ # keep only trnas and pseudogenes
+ trna_df = trna_df[
+ (trna_df["Region"] == "tRNA") | (trna_df["Region"] == "pseudogene")
+ ]
- #### crispr
- crispr_df = pd.read_csv(
- os.path.join(self.out_dir, self.prefix + "_minced.gff"),
- delimiter="\t",
- index_col=False,
- names=col_list,
- comment="#",
- )
+ #### crispr
+ crispr_df = pd.read_csv(
+ os.path.join(self.out_dir, self.prefix + "_minced.gff"),
+ delimiter="\t",
+ index_col=False,
+ names=col_list,
+ comment="#",
+ )
- #### tmrna
- tmrna_df = pd.read_csv(
- os.path.join(self.out_dir, self.prefix + "_aragorn.gff"),
- delimiter="\t",
- index_col=False,
- names=col_list,
- )
+ #### tmrna
+ tmrna_df = pd.read_csv(
+ os.path.join(self.out_dir, self.prefix + "_aragorn.gff"),
+ delimiter="\t",
+ index_col=False,
+ names=col_list,
+ )
# write descriptions for each contig
for contig in contigs:
@@ -1434,9 +1613,10 @@ def create_txt(self):
cds_count = len(
cds_mmseqs_merge_cont_df[cds_mmseqs_merge_cont_df["Region"] == "CDS"]
)
- trna_count = len(trna_df[trna_df["contig"] == contig])
- tmrna_count = len(tmrna_df[tmrna_df["contig"] == contig])
- crispr_count = len(crispr_df[crispr_df["contig"] == contig])
+ if self.skip_extra_annotations is False:
+ trna_count = len(trna_df[trna_df["contig"] == contig])
+ tmrna_count = len(tmrna_df[tmrna_df["contig"] == contig])
+ crispr_count = len(crispr_df[crispr_df["contig"] == contig])
if len(self.vfdb_results["contig"]) != 0:
vfdb_count = len(
self.vfdb_results[self.vfdb_results["contig"] == contig]
@@ -1590,20 +1770,23 @@ def create_txt(self):
}
)
- # add other features
- trna_row = pd.DataFrame(
- {"Description": ["tRNAs"], "Count": [trna_count], "contig": [contig]}
- )
- crispr_row = pd.DataFrame(
- {
- "Description": ["CRISPRs"],
- "Count": [crispr_count],
- "contig": [contig],
- }
- )
- tmrna_row = pd.DataFrame(
- {"Description": ["tmRNAs"], "Count": [tmrna_count], "contig": [contig]}
- )
+ # only if not skipped
+ if self.skip_extra_annotations is False:
+ # add other features
+ trna_row = pd.DataFrame(
+ {"Description": ["tRNAs"], "Count": [trna_count], "contig": [contig]}
+ )
+ crispr_row = pd.DataFrame(
+ {
+ "Description": ["CRISPRs"],
+ "Count": [crispr_count],
+ "contig": [contig],
+ }
+ )
+ tmrna_row = pd.DataFrame(
+ {"Description": ["tmRNAs"], "Count": [tmrna_count], "contig": [contig]}
+ )
+
vfdb_row = pd.DataFrame(
{
"Description": ["VFDB_Virulence_Factors"],
@@ -1626,9 +1809,10 @@ def create_txt(self):
] = cds_coding_density
# eappend it all to combo_list
combo_list.append(cds_df)
- combo_list.append(trna_row)
- combo_list.append(crispr_row)
- combo_list.append(tmrna_row)
+ if self.skip_extra_annotations is False:
+ combo_list.append(trna_row)
+ combo_list.append(crispr_row)
+ combo_list.append(tmrna_row)
combo_list.append(vfdb_row)
combo_list.append(CARD_row)
@@ -1917,7 +2101,7 @@ def create_mmseqs_tophits(out_dir):
##mmseqs
mmseqs_file = os.path.join(out_dir, "mmseqs_results.tsv")
- logger.info("Processing mmseqs2 outputs.")
+ logger.info("Processing MMseqs2 outputs.")
logger.info("Processing PHROGs output.")
col_list = [
"mmseqs_phrog",
@@ -1935,40 +2119,23 @@ def create_mmseqs_tophits(out_dir):
mmseqs_df = pd.read_csv(
mmseqs_file, delimiter="\t", index_col=False, names=col_list
)
- # get list of genes
- genes = mmseqs_df.gene.unique()
-
- # instantiate tophits list
- tophits = []
-
- for gene in genes:
- tmp_df = (
- mmseqs_df.loc[mmseqs_df["gene"] == gene]
- .sort_values("mmseqs_eVal")
- .reset_index(drop=True)
- .loc[0]
- )
- tophits.append(
- [
- tmp_df.mmseqs_phrog,
- tmp_df.gene,
- tmp_df.mmseqs_alnScore,
- tmp_df.mmseqs_seqIdentity,
- tmp_df.mmseqs_eVal,
- ]
- )
+ # optimise the tophits generation
+ # Group by 'gene' and find the top hit for each group
+ tophits_df = mmseqs_df.groupby('gene', group_keys=True).apply(lambda group: group.nsmallest(1, 'mmseqs_eVal')).reset_index(drop=True)
+
+
# create tophits df
- tophits_df = pd.DataFrame(
- tophits,
- columns=[
- "mmseqs_phrog",
- "gene",
- "mmseqs_alnScore",
- "mmseqs_seqIdentity",
- "mmseqs_eVal",
- ],
- )
+ tophits_df = tophits_df[[
+ "mmseqs_phrog",
+ "gene",
+ "mmseqs_alnScore",
+ "mmseqs_seqIdentity",
+ "mmseqs_eVal",
+ ]]
+
+
+
tophits_df.to_csv(
os.path.join(out_dir, "top_hits_mmseqs.tsv"), sep="\t", index=False
)
@@ -2010,6 +2177,10 @@ def remove_post_processing_files(out_dir, gene_predictor, meta):
remove_file(os.path.join(out_dir, "phanotate_out.txt"))
if gene_predictor == "prodigal":
remove_file(os.path.join(out_dir, "prodigal_out.gff"))
+ remove_file(os.path.join(out_dir, "prodigal_out_aas_tmp.fasta"))
+ elif gene_predictor == "prodigal-gv":
+ remove_file(os.path.join(out_dir, "prodigal-gv_out.gff"))
+ remove_file(os.path.join(out_dir, "prodigal-gv_out_aas_tmp.fasta"))
# delete the tmp meta files
if meta == True:
remove_directory(os.path.join(out_dir, "input_split_tmp/"))
@@ -2150,32 +2321,23 @@ def process_vfdb_results(out_dir, merged_df):
touch_file(vfdb_file)
vfdb_df = pd.read_csv(vfdb_file, delimiter="\t", index_col=False, names=col_list)
- genes = vfdb_df.gene.unique()
- # get top hit
- tophits = []
- for gene in genes:
- tmp_df = (
- vfdb_df.loc[vfdb_df["gene"] == gene]
- .sort_values("vfdb_eVal")
- .reset_index(drop=True)
- .loc[0]
- )
- tophits.append(
- [
- tmp_df.vfdb_hit,
- tmp_df.gene,
- tmp_df.vfdb_alnScore,
- tmp_df.vfdb_seqIdentity,
- tmp_df.vfdb_eVal,
- ]
- )
+ # optimise the tophits generation
+ # Group by 'gene' and find the top hit for each group
+ tophits_df = vfdb_df.groupby('gene', group_keys=True).apply(lambda group: group.nsmallest(1, 'vfdb_eVal')).reset_index(drop=True)
+
+
+
+ # create tophits df
+ tophits_df = tophits_df[[
+ "vfdb_hit",
+ "gene",
+ "vfdb_alnScore",
+ "vfdb_seqIdentity",
+ "vfdb_eVal",
+ ]]
- tophits_df = pd.DataFrame(
- tophits,
- columns=["vfdb_hit", "gene", "vfdb_alnScore", "vfdb_seqIdentity", "vfdb_eVal"],
- )
# left join vfdb to merged_df
tophits_df["gene"] = tophits_df["gene"].astype(str)
@@ -2254,32 +2416,21 @@ def process_card_results(out_dir, merged_df, db_dir):
]
touch_file(card_file)
card_df = pd.read_csv(card_file, delimiter="\t", index_col=False, names=col_list)
- genes = card_df.gene.unique()
- # get top hit
- tophits = []
+ #
+ tophits_df = card_df.groupby('gene', group_keys=True).apply(lambda group: group.nsmallest(1, 'CARD_eVal')).reset_index(drop=True)
- for gene in genes:
- tmp_df = (
- card_df.loc[card_df["gene"] == gene]
- .sort_values("CARD_eVal")
- .reset_index(drop=True)
- .loc[0]
- )
- tophits.append(
- [
- tmp_df.CARD_hit,
- tmp_df.gene,
- tmp_df.CARD_alnScore,
- tmp_df.CARD_seqIdentity,
- tmp_df.CARD_eVal,
- ]
- )
+
- tophits_df = pd.DataFrame(
- tophits,
- columns=["CARD_hit", "gene", "CARD_alnScore", "CARD_seqIdentity", "CARD_eVal"],
- )
+ # create tophits df
+ tophits_df = tophits_df[[
+ "CARD_hit",
+ "gene",
+ "CARD_alnScore",
+ "CARD_seqIdentity",
+ "CARD_eVal",
+ ]]
+
# left join tophits_df to merged_df
tophits_df["gene"] = tophits_df["gene"].astype(str)
diff --git a/bin/processes.py b/bin/processes.py
index 96bf2ef..8aa1f11 100644
--- a/bin/processes.py
+++ b/bin/processes.py
@@ -1,9 +1,11 @@
+import multiprocessing.pool
import os
import subprocess as sp
from datetime import datetime
import pandas as pd
import pyrodigal
+import pyrodigal_gv
from BCBio import GFF
from Bio import SeqIO
from Bio.Seq import Seq
@@ -12,6 +14,41 @@
from loguru import logger
from util import remove_directory
+
+def run_pyrodigal_gv(filepath_in, out_dir, threads):
+ """
+ Gets CDS using pyrodigal_gv
+ :param filepath_in: input filepath
+ :param out_dir: output directory
+ :param logger logger
+ :param meta Boolean - metagenomic mode flag
+ :param coding_table coding table for prodigal (default 11)
+ :return:
+ """
+
+ # true
+ orf_finder = pyrodigal_gv.ViralGeneFinder(meta=True)
+
+ def _find_genes(record):
+ genes = orf_finder.find_genes(str(record.seq))
+ return (record.id, genes)
+
+ with multiprocessing.pool.ThreadPool(threads) as pool:
+ with open(os.path.join(out_dir, "prodigal-gv_out.gff"), "w") as gff:
+ with open(os.path.join(out_dir, "prodigal-gv_out_tmp.fasta"), "w") as dst:
+ with open(
+ os.path.join(out_dir, "prodigal-gv_out_aas_tmp.fasta"), "w"
+ ) as aa_fasta:
+ records = SeqIO.parse(filepath_in, "fasta")
+ for record_id, genes in pool.imap(_find_genes, records):
+ genes.write_gff(
+ gff, sequence_id=record_id, include_translation_table=True
+ )
+ genes.write_genes(dst, sequence_id=record_id)
+ # need to write the translation
+ genes.write_translations(aa_fasta, sequence_id=record_id)
+
+
##### phanotate meta mode ########
@@ -254,7 +291,7 @@ def run_phanotate(filepath_in, out_dir, logdir):
logger.error("Error with Phanotate\n")
-def run_pyrodigal(filepath_in, out_dir, meta, coding_table):
+def run_pyrodigal(filepath_in, out_dir, meta, coding_table, threads):
"""
Gets CDS using pyrodigal
:param filepath_in: input filepath
@@ -262,6 +299,7 @@ def run_pyrodigal(filepath_in, out_dir, meta, coding_table):
:param logger logger
:param meta Boolean - metagenomic mode flag
:param coding_table coding table for prodigal (default 11)
+ :param threads: threads
:return:
"""
@@ -270,25 +308,52 @@ def run_pyrodigal(filepath_in, out_dir, meta, coding_table):
prodigal_metamode = True
logger.info("Prodigal Meta Mode Enabled")
- # for training if you want different coding table
- seqs = [bytes(record.seq) for record in SeqIO.parse(filepath_in, "fasta")]
- record = SeqIO.parse(filepath_in, "fasta")
- orf_finder = pyrodigal.OrfFinder(meta=prodigal_metamode)
-
- # coding table possible if false
- if prodigal_metamode == False:
- trainings_info = orf_finder.train(*seqs, translation_table=int(coding_table))
- orf_finder = pyrodigal.OrfFinder(trainings_info, meta=prodigal_metamode)
-
- with open(os.path.join(out_dir, "prodigal_out.gff"), "w") as dst:
- for i, record in enumerate(SeqIO.parse(filepath_in, "fasta")):
- genes = orf_finder.find_genes(str(record.seq))
- genes.write_gff(dst, sequence_id=record.id)
-
- with open(os.path.join(out_dir, "prodigal_out_tmp.fasta"), "w") as dst:
- for i, record in enumerate(SeqIO.parse(filepath_in, "fasta")):
- genes = orf_finder.find_genes(str(record.seq))
- genes.write_genes(dst, sequence_id=record.id)
+ #######################################################
+ # if under 20000, pyrodigal will only work in meta mode
+ # https://github.com/hyattpd/prodigal/wiki/Advice-by-Input-Type#plasmids-phages-viruses-and-other-short-sequences
+ # https://github.com/hyattpd/Prodigal/issues/51
+ # so make sure of this
+ #######################################################
+
+ # get total length of input
+ total_length = 0
+
+ with open(filepath_in, "r") as handle:
+ for record in SeqIO.parse(handle, "fasta"):
+ total_length += len(record.seq)
+
+ # if the length is 100000 or under, use meta mode by default
+ if total_length < 100001:
+ orf_finder = pyrodigal.GeneFinder(meta=True)
+ # otherwise train it
+ # recommend pyrodigal-gv anyway
+ else:
+ # for training if you want different coding table
+ seqs = [bytes(record.seq) for record in SeqIO.parse(filepath_in, "fasta")]
+ record = SeqIO.parse(filepath_in, "fasta")
+ orf_finder = pyrodigal.GeneFinder(meta=prodigal_metamode)
+
+ # make coding table possible if false
+ if prodigal_metamode == False:
+ orf_finder.train(*seqs, translation_table=int(coding_table))
+
+ # define for the multithreadpool
+ def _find_genes(record):
+ genes = orf_finder.find_genes(str(record.seq))
+ return (record.id, genes)
+
+ with multiprocessing.pool.ThreadPool(threads) as pool:
+ with open(os.path.join(out_dir, "prodigal_out.gff"), "w") as gff:
+ with open(os.path.join(out_dir, "prodigal_out_tmp.fasta"), "w") as dst:
+ with open(
+ os.path.join(out_dir, "prodigal_out_aas_tmp.fasta"), "w"
+ ) as aa_fasta:
+ records = SeqIO.parse(filepath_in, "fasta")
+ for record_id, genes in pool.imap(_find_genes, records):
+ genes.write_gff(gff, sequence_id=record_id)
+ genes.write_genes(dst, sequence_id=record_id)
+ # need to write the translation
+ genes.write_translations(aa_fasta, sequence_id=record_id)
def tidy_phanotate_output(out_dir):
@@ -319,13 +384,19 @@ def tidy_phanotate_output(out_dir):
return phan_df
-def tidy_prodigal_output(out_dir):
+def tidy_prodigal_output(out_dir, gv_flag):
"""
Tidies prodigal output
:param out_dir: output directory
+ :param gv_flag: if prodigal-gv, then True
:return: prod_filt_df pandas dataframe
"""
- prod_file = os.path.join(out_dir, "prodigal_out.gff")
+ if gv_flag is True:
+ prefix = "prodigal-gv"
+ else:
+ prefix = "prodigal"
+
+ prod_file = os.path.join(out_dir, f"{prefix}_out.gff")
col_list = [
"contig",
"prod",
@@ -367,7 +438,7 @@ def tidy_prodigal_output(out_dir):
+ prod_filt_df["stop"].astype(str)
)
prod_filt_df.to_csv(
- os.path.join(out_dir, "cleaned_prodigal.tsv"), sep="\t", index=False
+ os.path.join(out_dir, f"cleaned_{prefix}.tsv"), sep="\t", index=False
)
return prod_filt_df
@@ -479,14 +550,17 @@ def translate_fastas(out_dir, gene_predictor, coding_table, genbank_file):
if gene_predictor == "phanotate":
clean_df = tidy_phanotate_output(out_dir)
elif gene_predictor == "prodigal":
- clean_df = tidy_prodigal_output(out_dir)
+ clean_df = tidy_prodigal_output(out_dir, False) # gv_flag is false
+ elif gene_predictor == "prodigal-gv":
+ clean_df = tidy_prodigal_output(out_dir, True) # gv_flag is true
elif gene_predictor == "genbank":
clean_df = tidy_genbank_output(out_dir, genbank_file, coding_table)
- fasta_input_tmp = gene_predictor + "_out_tmp.fasta"
fasta_output_aas_tmp = gene_predictor + "_aas_tmp.fasta"
- if gene_predictor != "genbank":
+ if gene_predictor == "phanotate":
+ # read the nucl fasta
+ fasta_input_tmp = gene_predictor + "_out_tmp.fasta"
# translate for temporary AA output
with open(os.path.join(out_dir, fasta_output_aas_tmp), "w") as aa_fa:
i = 0
@@ -504,6 +578,26 @@ def translate_fastas(out_dir, gene_predictor, coding_table, genbank_file):
)
SeqIO.write(aa_record, aa_fa, "fasta")
i += 1
+ elif gene_predictor == "prodigal-gv" or gene_predictor == "prodigal":
+ # read in the AA file instead and parse that to clean the header
+ fasta_input_tmp = gene_predictor + "_out_aas_tmp.fasta"
+ with open(os.path.join(out_dir, fasta_output_aas_tmp), "w") as aa_fa:
+ i = 0
+ for dna_record in SeqIO.parse(
+ os.path.join(out_dir, fasta_input_tmp), "fasta"
+ ):
+ dna_header = str(clean_df["contig"].iloc[i]) + str(i)
+ dna_description = (
+ str(clean_df["start"].iloc[i]) + "_" + str(clean_df["stop"].iloc[i])
+ )
+ aa_record = SeqRecord(
+ dna_record.seq,
+ id=dna_header,
+ description=dna_description,
+ )
+ SeqIO.write(aa_record, aa_fa, "fasta")
+ i += 1
+ # for genbank do nothing
def run_trna_scan(filepath_in, threads, out_dir, logdir):
@@ -529,7 +623,7 @@ def run_trna_scan(filepath_in, threads, out_dir, logdir):
try:
ExternalTool.run_tool(trna)
except:
- logger.error("Error: tRNAscan-SE not found\n")
+ logger.error("Error with tRNAscan-SE")
return 0
@@ -631,53 +725,51 @@ def run_mmseqs(db_dir, out_dir, threads, logdir, gene_predictor, evalue, db_name
remove_directory(target_db_dir)
-def convert_gff_to_gbk(filepath_in, input_dir, out_dir, prefix, coding_table):
+def convert_gff_to_gbk(filepath_in, input_dir, out_dir, prefix, prot_seq_df):
"""
Converts the gff to genbank
:param filepath_in: input fasta file
- :param input_dir: input directory of the gff. same as output_dir for the overall gff, diff for meta mode
+ :param input_dir: input directory of the gff. same as output_dir for the overall gff in normal mode, differeny for meta mode
:param out_dir: output directory of the gbk
:param prefix: prefix
+ :param prefix: prot_seq_df from pharok object with gene name + protein sequence for all genes (from create_gff()).
:return:
"""
- gff_file = os.path.join(input_dir, prefix + ".gff")
- gbk_file = os.path.join(out_dir, prefix + ".gbk")
+
+ gff_file = os.path.join(input_dir, f"{prefix}.gff")
+ gbk_file = os.path.join(out_dir, f"{prefix}.gbk")
+
with open(gbk_file, "wt") as gbk_handler:
fasta_handler = SeqIO.to_dict(SeqIO.parse(filepath_in, "fasta"))
for record in GFF.parse(gff_file, fasta_handler):
+ # sequence in each contig (record)
+ subset_seqs_df = prot_seq_df.loc[prot_seq_df["contig"] == record.id]
+ # get all the seqs in the contigs - and drop the index to reset for 0 indexed loop
+ subset_seqs = subset_seqs_df["sequence"].reset_index(drop=True)
+ # start the loop
+ i = 0
+
# instantiate record
record.annotations["molecule_type"] = "DNA"
record.annotations["date"] = datetime.today()
record.annotations["topology"] = "linear"
- record.annotations["data_file_division"] = "VRL"
+ record.annotations[
+ "data_file_division"
+ ] = "PHG" # https://github.com/RyanCook94/inphared/issues/22
# add features to the record
for feature in record.features:
# add translation only if CDS
if feature.type == "CDS":
+ # aa = prot_records[i].seq
if feature.strand == 1:
feature.qualifiers.update(
- {
- "translation": Seq.translate(
- record.seq[
- feature.location.start.position : feature.location.end.position
- ],
- to_stop=True,
- table=coding_table,
- )
- }
+ {"translation": subset_seqs[i]} # from the aa seq
)
else: # reverse strand -1 needs reverse compliment
feature.qualifiers.update(
- {
- "translation": Seq.translate(
- record.seq[
- feature.location.start.position : feature.location.end.position
- ].reverse_complement(),
- to_stop=True,
- table=coding_table,
- )
- }
+ {"translation": subset_seqs[i]} # from the aa seq
)
+ i += 1
SeqIO.write(record, gbk_handler, "genbank")
diff --git a/bin/run_pyrodigal_gv.py b/bin/run_pyrodigal_gv.py
new file mode 100644
index 0000000..96b2795
--- /dev/null
+++ b/bin/run_pyrodigal_gv.py
@@ -0,0 +1,35 @@
+import os
+
+import pyrodigal_gv
+from Bio import SeqIO
+
+# from Bio.Seq import Seq
+# from Bio.SeqRecord import SeqRecord
+# from external_tools import ExternalTool
+# from loguru import logger
+# from util import remove_directory
+
+
+def run_pyrodiga_gv(filepath_in, out_dir, coding_table):
+ """
+ Gets CDS using pyrodigal_gv
+ :param filepath_in: input filepath
+ :param out_dir: output directory
+ :param logger logger
+ :param meta Boolean - metagenomic mode flag
+ :param coding_table coding table for prodigal (default 11)
+ :return:
+ """
+
+ # true
+ orf_finder = pyrodigal_gv.ViralGeneFinder(meta=True)
+
+ with open(os.path.join(out_dir, "prodigal_out.gff"), "w") as dst:
+ for i, record in enumerate(SeqIO.parse(filepath_in, "fasta")):
+ genes = orf_finder.find_genes(str(record.seq))
+ genes.write_gff(dst, sequence_id=record.id)
+
+ with open(os.path.join(out_dir, "prodigal_out_tmp.fasta"), "w") as dst:
+ for i, record in enumerate(SeqIO.parse(filepath_in, "fasta")):
+ genes = orf_finder.find_genes(str(record.seq))
+ genes.write_genes(dst, sequence_id=record.id)
diff --git a/bin/version.py b/bin/version.py
index bf25615..5b60188 100644
--- a/bin/version.py
+++ b/bin/version.py
@@ -1 +1 @@
-__version__ = "1.4.1"
+__version__ = "1.5.0"
diff --git a/build/build.sh b/build/build.sh
deleted file mode 100644
index 36241a3..0000000
--- a/build/build.sh
+++ /dev/null
@@ -1,8 +0,0 @@
-#!/bin/sh
-set -e
-
-mkdir -p "${PREFIX}/bin"
-mkdir -p "${PREFIX}/bin/modules"
-
-cp -r bin/* "${PREFIX}/bin/"
-cp -r bin/modules/* "${PREFIX}/bin/modules"
diff --git a/build/meta.yaml b/build/meta.yaml
deleted file mode 100644
index 983ef57..0000000
--- a/build/meta.yaml
+++ /dev/null
@@ -1,49 +0,0 @@
-{% set version = "1.3.0" %}
-{% set name = "pharokka" %}
-{% set sha256 = "17055ecc532dbc18908ba2d6da5675ce7c08b3ab2767d4def7de9935281bc4df" %}
-{% set user = "gbouras13" %}
-
-package:
- name: {{ name }}
- version: {{ version }}
-
-build:
- number: 0
- noarch: python
-
-source:
- url: https://github.com/{{ user }}/{{ name }}/archive/refs/tags/v{{ version }}.tar.gz
- sha256: {{ sha256 }}
-
-requirements:
- run:
- - bcbio-gff
- - biopython >=1.78,<1.81
- - phanotate >=1.5.0
- - mmseqs2 ==13.45111
- - trnascan-se >=2.0.9
- - minced >=0.4.2
- - aragorn >=1.2.41
- - mash >=2.2
- - pyrodigal >=2.0.1
- - pycirclize >=0.3.1
- - dnaapler >= 0.3.0
-
-test:
- commands:
- - install_databases.py -h
- - pharokka.py -h
- - pharokka_plotter.py -h
- - pharokka_proteins.py -h
-
-about:
- home: https://github.com/gbouras13/pharokka
- license: MIT
- license_file: LICENSE
- summary: "Fast Phage Annotation Program"
- dev_url: https://github.com/gbouras13/pharokka
- doc_url: https://pharokka.readthedocs.io
-
-extra:
- recipe-maintainers:
- - gbouras13
diff --git a/docs/benchmarking.md b/docs/benchmarking.md
index 4d04e21..55ca0e4 100644
--- a/docs/benchmarking.md
+++ b/docs/benchmarking.md
@@ -67,4 +67,20 @@ The 673 crAss-like genomes were run with `-m` (defaults to `--mmseqs2_only` in v
| Annotated Function CDS | **16713** | 9150 | 9150 |
| Unknown Function CDS | 75286 | 82849 | 82849 |
-
\ No newline at end of file
+
+# Benchmarking v1.5.0
+
+`pharokka v1.5.0` was run on the 673 crAss phage dataset to showcase the improved CDS prediction of `-g prodigal-gv` for metagenomic datasets where some phages likely have alternative genetic codes.
+
+All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with 8 threads (`-t 8`). `pyrodigal-gv v0.1.0` and `pyrodigal v3.0.0` were used respectively with `--fast`.
+
+| 673 crAss-like genomes | `pharokka` v1.5.0 `-g prodigal-gv` | `pharokka` v1.5.0 `-g prodigal` |
+|------------------------|------------------------------------|----------------------------------|
+| Total CDS | 81730 | 91999 |
+| Annotated Function CDS | **20344** | 17458 |
+| Unknown Function CDS | 61386 | 74541 |
+| Contigs with genetic code 15 | 229 | NA |
+| Contigs with genetic code 4 | 38 | NA |
+| Contigs with genetic code 11 | 406 | 673 |
+
+Fewer larger CDS were predicted more accurately, leading to an increase in the number of coding sequences with annotated functions. Approximately 40% of contigs in this dataset were predicted to use non-standard genetic codes according to `pyrodigal-gv`.
\ No newline at end of file
diff --git a/docs/citation.md b/docs/citation.md
index df6d077..0ce2144 100644
--- a/docs/citation.md
+++ b/docs/citation.md
@@ -7,7 +7,7 @@ If you use pharokka, I would recommend a citation in your manuscript along the l
* All phages were annotated with Pharokka v ___ (Bouras, et al. 2023). Specifically, coding sequences (CDS) were predicted with PHANOTATE (McNair, et al. 2019), tRNAs were predicted with tRNAscan-SE 2.0 (Chan, et al. 2021), tmRNAs were predicted with Aragorn (Laslett, et al. 2004) and CRISPRs were preducted with CRT (Bland, et al. 2007). Functional annotation was generated by matching each CDS to the PHROGs (Terzian, et al. 2021), VFDB (Chen, et al. 2005) and CARD (Alcock, et al. 2020) databases using MMseqs2 (Steinegger, et al. 2017) and PyHMMER (Larralde and Zeller 2023). Contigs were matched to their closest hit in the INPHARED database (Cook, et al. 2021) using mash (Ondov, et al. 2016). Plots were created with pyCirclize (Shimoyama 2022).
-With the following full citations for the constituent tools below:
+With the following full citations for the constituent tools below where relevant:
* Cook R, Brown N, Redgwell T, Rihtman B, Barnes M, Clokie M, Stekel DJ, Hobman JL, Jones MA, Millard A. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. 2021. Available from: http://doi.org/10.1089/phage.2021.0007.
* McNair K., Zhou C., Dinsdale E.A., Souza B., Edwards R.A. (2019) "PHANOTATE: a novel approach to gene identification in phage genomes", Bioinformatics, https://doi.org/10.1093/bioinformatics/btz26.
@@ -21,4 +21,5 @@ With the following full citations for the constituent tools below:
* Alcock et al, "CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database." Nucleic Acids Research (2020) https://doi.org/10.1093/nar/gkz935.
* Larralde, M., (2022). Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. Journal of Open Source Software, 7(72), 4296. doi:10.21105/joss.04296.
* Larralde M., Zeller G., (2023). PyHMMER: a Python library binding to HMMER for efficient sequence analysis, Bioinformatics, Volume 39, Issue 5, May 2023, btad214, https://doi.org/10.1093/bioinformatics/btad214.
+* Larradle M. and Camargo A., (2023) Pyrodigal-gv: A Pyrodigal extension to predict genes in giant viruses and viruses with alternative genetic code. https://github.com/althonos/pyrodigal-gv.
* Shimoyama, Y. (2022). pyCirclize: Circular visualization in Python [Computer software]. https://github.com/moshi4/pyCirclize
\ No newline at end of file
diff --git a/docs/index.md b/docs/index.md
index 46ee94b..f1624f3 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,17 +1,17 @@
+# `pharokka`
+
`pharokka` is a fast phage annotation pipeline.
![Image](pharokka_logo.png)
-`pharokka` uses PHANOTATE (McNair et al 2019) to conduct gene prediction, tRNAscan-SE 2 (Chan et al 2021) to call tRNAs, MinCED (Bland et al 2007) to detect CRISPRs and Aragorn (Laslett et al 2004) to detect tmRNAs. There is also the option to specify Prodigal (Hyatt et al 2010) implemented with Pyrodigal (Larralde, 2022) instead of PHANOTATE.
-
-`pharokka` then uses the lightweight PHROGS database (Terzian et al 2021) for functional annotation of all predicted CDSs using MMseqs2 (Steinegger et al 2017), and as of v1.4.0, PyHMMER (Larralde and Zeller 2023) for more sensitive annotations. `pharokka` also matches each predicted CDS against the VFDB (Chen et al 2005) and CARD (Alcock et al 2020) databases to predict virulence factors and antimicrobial resistance, respectively.
-
-For more information, please read the `pharokka` manuscript:
+## Overview
-George Bouras, Roshan Nepal, Ghais Houtak, Alkis James Psaltis, Peter-John Wormald, Sarah Vreugde, Pharokka: a fast scalable bacteriophage annotation tool, Bioinformatics, Volume 39, Issue 1, January 2023, btac776, [https://doi.org/10.1093/bioinformatics/btac776](https://doi.org/10.1093/bioinformatics/btac776)
+`pharokka` uses [PHANOTATE](https://github.com/deprekate/PHANOTATE), the only gene prediction program tailored to bacteriophages, as the default program for gene prediction. [Prodigal](https://github.com/hyattpd/Prodigal) implemented with [pyrodigal](https://github.com/althonos/pyrodigal) and [Prodigal-gv](https://github.com/apcamargo/prodigal-gv) implemented with [pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) are also available as alternatives. Following this, functional annotations are assigned by matching each predicted coding sequence (CDS) to the [PHROGs](https://phrogs.lmge.uca.fr), [CARD](https://card.mcmaster.ca) and [VFDB](http://www.mgc.ac.cn/VFs/main.htm) databases using [MMseqs2](https://github.com/soedinglab/MMseqs2). As of v1.4.0, `pharokka` will also match each CDS to the PHROGs database using more sensitive Hidden Markov Models using [PyHMMER](https://github.com/althonos/pyhmmer). Pharokka's main output is a GFF file suitable for using in downstream pangenomic pipelines like [Roary](https://sanger-pathogens.github.io/Roary/). `pharokka` also generates a `cds_functions.tsv` file, which includes counts of CDSs, tRNAs, tmRNAs, CRISPRs and functions assigned to CDSs according to the PHROGs database. See the full [usage](#usage) and check out the full [documentation](https://pharokka.readthedocs.io) for more details.
![Image](pharokka_workflow.png)
+## Manuscript
+For more information, please read the `pharokka` manuscript:
-
+George Bouras, Roshan Nepal, Ghais Houtak, Alkis James Psaltis, Peter-John Wormald, Sarah Vreugde, Pharokka: a fast scalable bacteriophage annotation tool, Bioinformatics, Volume 39, Issue 1, January 2023, btac776, [https://doi.org/10.1093/bioinformatics/btac776](https://doi.org/10.1093/bioinformatics/btac776)
diff --git a/docs/install.md b/docs/install.md
index 0857060..112fbb3 100644
--- a/docs/install.md
+++ b/docs/install.md
@@ -53,7 +53,7 @@ pharokka.py --help
# Database Installation
-* **Note v 1.4.0 implements a new database with PHROGs HMM profiles. You will need to update the Pharokka database to use v1.4.0**
+* **Note v 1.4.0 implements a new database with PHROGs HMM profiles. You will need to update the Pharokka database to use v1.4.0 and higher**
To install the pharokka database to the default directory:
diff --git a/docs/output.md b/docs/output.md
index dee51e4..a3d6bc7 100644
--- a/docs/output.md
+++ b/docs/output.md
@@ -18,7 +18,7 @@ Other Files
* A `_cds_functions.tsv` file, which includes counts of CDSs, tRNAs, CRISPRs and tmRNAs and functions assigned to CDSs according to the PHROGs database.
-* A `_length_gc_cds_density.tsv` file, which outputs the phage's length, GC percentage and CDS coding density.
+* A `_length_gc_cds_density.tsv` file, which outputs the phage's length, GC percentage, translation table and CDS coding density.
* `phanotate.ffn` or `prodigal.ffn` which will hold all nucleotide sequences of predicted CDSs.
diff --git a/docs/pharokka_workflow.png b/docs/pharokka_workflow.png
index 3caf568..6b86751 100644
Binary files a/docs/pharokka_workflow.png and b/docs/pharokka_workflow.png differ
diff --git a/docs/run.md b/docs/run.md
index 86bb718..868a8b5 100644
--- a/docs/run.md
+++ b/docs/run.md
@@ -50,12 +50,32 @@ To use Prodigal (pyrodigal) gene predictions instead of PHANOTATE use `-g prodig
pharokka.py -i -o