A snakemake pipeline for creating annotated Tandem Repeats for human reference genomes version GRCh37, GRCh38, and CHM13 based on tandem repeats in the Genome In A Bottle (GIAB) tandem repeat stratifications.
The pipeline is based on the work found in the regions
directory of the adotto repo,
where the methods used to develop the adotto v1.0 GRCh38
tandem repeat database were developed and documented.
Briefly, input tandem repeat bed file (the GIAB AllTandemRepeat.bed.gz in this case) is preprocessed where regions in the bed file smaller than 10bp and larger than 50kb are filtered
the remaining regions are expanded by 25 bp on either side.
The regions are annotated using TandemRepeatFinder and RepeatMasker with additional annotations calculated using a custom script.
See the adotto repo for additional information and see Data Descriptor for a description of the Adotto bed file format.
Note that the Adotto GRCh38 bed files generated by AC English and described in the adotto repo
makes use of additional tandem repeat sources and annotations.
For GRCh38 the bed files listed on the Adotto repo are more complete and better annotated.
The objective of this work is to provide a streamline method to generate comparable Adotto bed files for the three commonly used versions
of the human reference genome GRCh37, GRCh38, and CHM13.
To run the pipeline, you'll need Snakemake installed. Once installed, you can initiate the pipeline using the following command:
snakemake --use-conda -j [number of threads]
This will execute the workflow, and Snakemake will automatically handle the creation of environments using Conda for each rule that requires specific software.
The pipeline requires a configuration file named config.yaml
to specify various parameters and input data. The structure of this configuration file is as follows:
references:
GRCh37:
REFURL: "URL_TO_GRCh37_REFERENCE"
REF_MD5: "MD5_CHECKSUM_FOR_GRCh37"
GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_GRCh37"
GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_GRCh37"
GRCh38:
REFURL: "URL_TO_GRCh38_REFERENCE"
REF_MD5: "MD5_CHECKSUM_FOR_GRCh38"
GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_GRCh38"
GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_GRCh38"
CHM13:
REFURL: "URL_TO_CHM13_REFERENCE"
REF_MD5: "MD5_CHECKSUM_FOR_CHM13"
GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_CHM13"
GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_CHM13"
n_splits: NUMBER_OF_SPLITS_FOR_FASTA
rm_threads: REPEAT_MASKER_THREADS
Note: Replace the placeholders (URL_TO...
, MD5_CHECKSUM_FOR...
, NUMBER_OF_SPLITS_FOR_FASTA
) with the actual values as per your setup.
The config/config.yml
includes the URLs and MD5s for GIAB hosted resources (excluding CHM13v2.0).
For testing purposes .test/config.yml
provides urls and md5 for a test dataset (GRCh38 chromosome 21 only).
REFURL
: URL to the reference genome.REF_MD5
: MD5 checksum for the reference genome to ensure data integrity.GIABTRURL
: URL to the GIAB tandem repeat stratification data.GIABTR_MD5
: MD5 checksum for the GIAB tandem repeat stratification data.n_splits
: Specifies how many parts the reference genome should be split into during the RepeatMasker annotation phase.
If you'd like to contribute to the development of this pipeline or report any issues, please submit a pull request or submit an issue.