This repository includes a Snakemake workflow used for bioinformatic analysis for the paper "Rhodopsin-bestrophin fusion proteins from unicellular algae form gigantic pentameric ion channels". Please refer to the paper for further details. Additional data are available from the paper and from this repository.
The entry point is workflow/Snakemake
, so the workflow can be launched by simply calling snakemake -c[number_of_cores]
in the root directory. Most of the dependencies are taken care of with conda
(see workflow/envs
for details) -- use snakemake
's --use-conda
. However, a number of third-party programs have to be installed manually (the used versions and the executables expected to be in the $PATH
):
- Geneconv v1.81a -
geneconv
- Root Digger v1.7.0-7-gccbe87e -
rd
- busco v5.beta.1 (for other versions conda can be used) -
busco
- TtreeShrink v1.3.6 -
run_treeshrink.py
- ASTRAL v5.7.4 - specify the path to
astral.version.jar
inconfig/config.yaml
(also notice thatjava
is not installed via conda) - ERaBLE v1.0 -
erable
- SequenceBouncer v1.18 -
SequenceBouncer.py
- SignalP v4.1 -
signalp
(versions >4 are not supported) - TargetP v2.0 -
targetp
- ASAFind v1.1.7 -
ASAFind.py
- InterProScan v5.48-83.0 (available from conda, but we keep a global installation) -
interproscan.sh
The workflow consists of four independent components:
- Codon-based analyses:
- codon and protein phylogenies of the domains
- different rooting strategies
- recombination analyses
- Global phylogenies for:
- rhodopsins and
- bestrophins
- Species phylogenies for:
- chlorophytes,
- dinoflagellates and
- haptophytes
- Structural alignment for:
- rhodopsins and
- bestrophins
The input files are provided in the input/
folder. In particular:
-
inputs for the codon-based analysis are codon sequences for bestrhodopsin domains:
input/codons/domains/bestrophins.fasta
andinput/codons/domains/bestrophins.fasta
. Included in the fasta files are also outgroup sequences trimmed to homologous regions. -
inputs for the species phylogenies are expected in the
input/species/
directory, but the complete fasta files with all protein sequences for the assemblies are not included in the repository. Instead the pipeline can be started from the extracted orthogroups that can be downloaded from the data repository. -
inputs for the global bestrophin phylogeny were downloaded from Uniref and Pfam and are provided as
input/global_phylogeny/bestrophins/uniprot_aln.fasta
(Pfam bestrophin alignment),input/global_phylogeny/bestrophins/uniref50.fasta
(Uniref50 sequences of bestrophins),input/global_phylogeny/bestrophins/uniref50.txt
(list of bestrophins from Uniref50 in tabular-format). -
inputs for the global rhodopsin phylogeny are also available in
input/global_phylogeny/rhodopsins
. The filerhodopsin_selected_8TMs_from_uniref50.fasta
is derived from uniref50 sequences matched to the microbial rhodopsin Pfam profile.
The output files are collected in the output/
folder:
output/codons/rooted_trees.svg
includes rooted domain trees:
- of the rhodopsin domains based on the codon alignment and the same topology with branch lengths optimized based on the protein alignment;
- of the bestrophin domains - protein tree including two outgroup sequences (A0A6T5JQZ6 and A0A6T5CHT1), codon tree with TGD-B, codon tree w/o TGD-B and the same tree topology with branch lengths optimized based on the protein alignment. The last three trees are rooted with Root Digger with the numbers next to the root indicating Likelihood Weight Ratio of the root placement.
output/codons/rhodopsin_recombination.svg
covers recombination analyses, from top to bottom:
- GENECONV analysis
- pairwise nucleotide identity profiles for selected sequences averaged over sliding windows of 15 bp
- GARD analysis
-
output/global_phylogeny/
includes global phylogeny trees (newick and annotated pdf) -
output/species/*.svg
are visualizations of the species trees -
output/species/species_chronos.nwk
chronos-scaled species trees in newick format