This is a pipeline to (briefly described):
- Predict proteins from transcriptomes (transdecoder),
- Find orhogroups with OrthoFinder, and methods from Yang et al.
- Find patterns of positive selection with FastCodeML.
- Annotate transcripts with transdecoder / trinotate
- Assess transcriptome completeness with Busco
-
Install conda
-
Install
snakemake
:
conda install --yes snakemake
- Clone this repo. In case of error with SSL certificates, add
-c http.sslVerify=false
git clone --recursive https://github.com/jlanga/smsk_orthofinder.git
- Compile the necessary dependencies:
phyx
,guidance
andfastcodeml
:
bash src/compile_deps.sh
-
Introduce the paths to your samples in
samples.tsv
. -
Run the pipeline as is:
snakemake --use-conda --jobs
or run it inside a Docker container:
bash src/docker_run.sh -j 4
The hierarchy of the folder is the one described in A Quick Guide to Organizing Computational Biology Projects:
smsk_selection
├── data: raw data, downloaded fastas, databases,....
├── README.md
├── Snakefile: Pipeline runner
├── results: processed data.
| ├── busco: SCOs identified
| ├── cdhit: clustered transcriptome
| ├── homologs: clustered orthogroups as in Yang et al.
| ├── orthofinder: clustered orthogroups by orthofinder
| ├── selection: alignments and positive selection results
| ├── transcriptome: links to input transcriptomes
| ├── transdecoder: predicted CDS
| ├── tree: ML and bayesian species tree from 4fold degenerate sites
| └── trinotate: transcriptome annotation
└── src: additional source code, tarballs, snakefiles, etc.
To run this pipeline it should be only necessary to have snakemake
and conda
/ mamba
. They together are able to download the required packages to run each step.
In case of doubt, the Dockerfile
contains the list of the required packages to install.