The bestrhodopsin analysis pipeline

This repository includes a Snakemake workflow used for bioinformatic analysis for the paper "Rhodopsin-bestrophin fusion proteins from unicellular algae form gigantic pentameric ion channels". Please refer to the paper for further details. Additional data are available from the paper and from this repository.

The entry point is workflow/Snakemake, so the workflow can be launched by simply calling snakemake -c[number_of_cores] in the root directory. Most of the dependencies are taken care of with conda (see workflow/envs for details) -- use snakemake's --use-conda. However, a number of third-party programs have to be installed manually (the used versions and the executables expected to be in the $PATH):

Geneconv v1.81a - geneconv
Root Digger v1.7.0-7-gccbe87e - rd
busco v5.beta.1 (for other versions conda can be used) - busco
TtreeShrink v1.3.6 - run_treeshrink.py
ASTRAL v5.7.4 - specify the path to astral.version.jar in config/config.yaml (also notice that java is not installed via conda)
ERaBLE v1.0 - erable
SequenceBouncer v1.18 - SequenceBouncer.py
SignalP v4.1 - signalp (versions >4 are not supported)
TargetP v2.0 - targetp
ASAFind v1.1.7 - ASAFind.py
InterProScan v5.48-83.0 (available from conda, but we keep a global installation) - interproscan.sh

The workflow

The workflow consists of four independent components:

Codon-based analyses:

codon and protein phylogenies of the domains
different rooting strategies
recombination analyses

Global phylogenies for:

rhodopsins and
bestrophins

Species phylogenies for:

chlorophytes,
dinoflagellates and
haptophytes

Structural alignment for:

rhodopsins and
bestrophins

Input files

The input files are provided in the input/ folder. In particular:

inputs for the codon-based analysis are codon sequences for bestrhodopsin domains: input/codons/domains/bestrophins.fasta and input/codons/domains/bestrophins.fasta. Included in the fasta files are also outgroup sequences trimmed to homologous regions.
inputs for the species phylogenies are expected in the input/species/ directory, but the complete fasta files with all protein sequences for the assemblies are not included in the repository. Instead the pipeline can be started from the extracted orthogroups that can be downloaded from the data repository.
inputs for the global bestrophin phylogeny were downloaded from Uniref and Pfam and are provided as input/global_phylogeny/bestrophins/uniprot_aln.fasta (Pfam bestrophin alignment), input/global_phylogeny/bestrophins/uniref50.fasta (Uniref50 sequences of bestrophins), input/global_phylogeny/bestrophins/uniref50.txt (list of bestrophins from Uniref50 in tabular-format).
inputs for the global rhodopsin phylogeny are also available in input/global_phylogeny/rhodopsins. The file rhodopsin_selected_8TMs_from_uniref50.fasta is derived from uniref50 sequences matched to the microbial rhodopsin Pfam profile.

Output files

The output files are collected in the output/ folder:

output/codons/rooted_trees.svg includes rooted domain trees:

of the rhodopsin domains based on the codon alignment and the same topology with branch lengths optimized based on the protein alignment;
of the bestrophin domains - protein tree including two outgroup sequences (A0A6T5JQZ6 and A0A6T5CHT1), codon tree with TGD-B, codon tree w/o TGD-B and the same tree topology with branch lengths optimized based on the protein alignment. The last three trees are rooted with Root Digger with the numbers next to the root indicating Likelihood Weight Ratio of the root placement.

output/codons/rhodopsin_recombination.svg covers recombination analyses, from top to bottom:

GENECONV analysis
pairwise nucleotide identity profiles for selected sequences averaged over sliding windows of 15 bp
GARD analysis

output/global_phylogeny/ includes global phylogeny trees (newick and annotated pdf)
output/species/*.svg are visualizations of the species trees
output/species/species_chronos.nwk chronos-scaled species trees in newick format

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
data/pdb		data/pdb
output		output
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The bestrhodopsin analysis pipeline

The workflow

Input files

Output files

About

Releases 1

Packages

Languages

License

BejaLab/RRB

Folders and files

Latest commit

History

Repository files navigation

The bestrhodopsin analysis pipeline

The workflow

Input files

Output files

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages