Skip to content

jonasmehl1/structural_phylogenetics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A pipeline for single gene structural phylogenies

Snakemake

This repo impelements multiple phylogenetic methods based on protein structures.

Installation

Creeate the conda environment that contains all necessary dependencies

conda env create --file=workflow/envs/structural_phylogeny.yaml

Data preparation

All that is needed is a file with uniprot identifiers called identifiers.txt (example in data/input/identifiers.txt) A common workflow would be to identify homologs of a query sequence on the AlphaFoldDB with Foldseek and then download the resulting table. This table can then be converted to the right format by using

python processing/AFDB_to_uniprot.py table.csv

Structures and sequences as well as metadata will then be downloaded automatically when running the snakemake pipeline.

Usage

The normal way to run this pipeline would be to specify the name of the output directory to be created in /results and the name of the identifiers file found in data/input

snakemake --configfile config/params.yaml --config outdir=test identifiers=identifiers_test.txt -p -k -c 16

Since the pipeline can take some time it can also be launched in a non interactive session to prevent termial timeout

nohup snakemake --configfile config/params.yaml --config outdir=test identifiers=identifiers_test.txt -p -k -c 16 > .snakemake/log/snakemake_$(date +“%Y-%m-%d_%H-%M-%S”).log 2>&1 &

Phylogeny pipeline

Structures are aligned using FoldMason and then trimmed using ClipKit. The pipeline will by default run a standard iqtree ML run on the amino acid alignment, as well as a ML run on the 3di alignment using the AF substitution matrix by Garg & Hochberg, 2024. Alternatively the 3di matrix by Puente-Lelievre et al., 2024 can also be used. Both alignments will then be concatenated to run a partition scheme. If desired (needs to be enabled in the main snakefile), additionally another tree will be computed with FastME using the more conventional Intramolecular distance metric, which is implemented in T-COFFEE. This includes a bootstrapping procedure where 100 trees are created from 100 bootstrapped matrices and combined.

Main snakemake pipeline

Below are the default parameters for tree reconstruction, which can be customized in config/params.yaml

  • Alignment:
    • 3Di: foldmason easy-msa {input_structures} {output_alignment} /tmp --report-mode 1
    • aa: samee as for 3di as foldmason outputs both. Alternatively can also use mafft
    • comb: concatenate aa and 3Di

-- report-mode 1 will also create a html report that can be used to interactively visualize the structure alignment. Caution: this might take some time and resources to load if the alignment is large

  • Trimming:

    • clipkit(default): clipkit {alignment} -o {output} -m gappy -l. This is the more conservative option
    • trimal trimal -in {alignment} -out {output} -automated1. More aggressive trimming approach that is used for the FastME trees.
  • Tree inference:

    • AA: iqtree2 -s {aa.alignment} --prefix $tree_prefix -B 1000 -T {threads} --boot-trees --quiet --mem 16G --cmin 4 --cmax 12 –m MFP. Uses Modelfinder to find the best fit model
    • AF: iqtree2 -s {3di.alignment}{same as AA} –mset resources/subst_matrixes/Q_mat_AF_Garg.txt
    • FM: fastme -i {distmat} -o {output} -g 1.0 -s -n -z 5
    • 3Di: iqtree2 -s {3di.alignment} {same as AF} –mset 3DI -mdef resources/subst_matrixes/3DI.nexus
    • Part: Create partition file with best model from ML and AF. Then iqtree2 -s {combined_alignment} -p {input.part} -B 1000 -T {threads}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published