Tandem Repeat Annotation Pipeline

A snakemake pipeline for creating annotated Tandem Repeats for human reference genomes version GRCh37, GRCh38, and CHM13 based on tandem repeats in the Genome In A Bottle (GIAB) tandem repeat stratifications.

Background

The pipeline is based on the work found in the regions directory of the adotto repo, where the methods used to develop the adotto v1.0 GRCh38 tandem repeat database were developed and documented. Briefly, input tandem repeat bed file (the GIAB AllTandemRepeat.bed.gz in this case) is preprocessed where regions in the bed file smaller than 10bp and larger than 50kb are filtered the remaining regions are expanded by 25 bp on either side. The regions are annotated using TandemRepeatFinder and RepeatMasker with additional annotations calculated using a custom script. See the adotto repo for additional information and see Data Descriptor for a description of the Adotto bed file format. Note that the Adotto GRCh38 bed files generated by AC English and described in the adotto repo makes use of additional tandem repeat sources and annotations. For GRCh38 the bed files listed on the Adotto repo are more complete and better annotated. The objective of this work is to provide a streamline method to generate comparable Adotto bed files for the three commonly used versions of the human reference genome GRCh37, GRCh38, and CHM13.

Usage

To run the pipeline, you'll need Snakemake installed. Once installed, you can initiate the pipeline using the following command:

snakemake --use-conda -j [number of threads]

This will execute the workflow, and Snakemake will automatically handle the creation of environments using Conda for each rule that requires specific software.

Configuration

The pipeline requires a configuration file named config.yaml to specify various parameters and input data. The structure of this configuration file is as follows:

references:
  GRCh37:
    REFURL: "URL_TO_GRCh37_REFERENCE"
    REF_MD5: "MD5_CHECKSUM_FOR_GRCh37"
    GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_GRCh37"
    GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_GRCh37"
  GRCh38:
    REFURL: "URL_TO_GRCh38_REFERENCE"
    REF_MD5: "MD5_CHECKSUM_FOR_GRCh38"
    GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_GRCh38"
    GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_GRCh38"
  CHM13:
    REFURL: "URL_TO_CHM13_REFERENCE"
    REF_MD5: "MD5_CHECKSUM_FOR_CHM13"
    GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_CHM13"
    GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_CHM13"
n_splits: NUMBER_OF_SPLITS_FOR_FASTA
rm_threads: REPEAT_MASKER_THREADS

Note: Replace the placeholders (URL_TO..., MD5_CHECKSUM_FOR..., NUMBER_OF_SPLITS_FOR_FASTA) with the actual values as per your setup. The config/config.yml includes the URLs and MD5s for GIAB hosted resources (excluding CHM13v2.0). For testing purposes .test/config.yml provides urls and md5 for a test dataset (GRCh38 chromosome 21 only).

Configuration Values

REFURL: URL to the reference genome.
REF_MD5: MD5 checksum for the reference genome to ensure data integrity.
GIABTRURL: URL to the GIAB tandem repeat stratification data.
GIABTR_MD5: MD5 checksum for the GIAB tandem repeat stratification data.
n_splits: Specifies how many parts the reference genome should be split into during the RepeatMasker annotation phase.

Contributing

If you'd like to contribute to the development of this pipeline or report any issues, please submit a pull request or submit an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github/workflows		.github/workflows
.test		.test
config		config
docs		docs
workflow		workflow
.gitignore		.gitignore
.snakemake-workflow-catalog.yml		.snakemake-workflow-catalog.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tandem Repeat Annotation Pipeline

Background

Usage

Configuration

Configuration Values

Contributing

About

Releases

Packages

Languages

License

nate-d-olson/adotto-smk

Folders and files

Latest commit

History

Repository files navigation

Tandem Repeat Annotation Pipeline

Background

Usage

Configuration

Configuration Values

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages