Skip to content

A snakemake based pipeline to build Adotto TR databases

License

Notifications You must be signed in to change notification settings

nate-d-olson/adotto-smk

Repository files navigation

Tandem Repeat Annotation Pipeline

CI Linting black

A snakemake pipeline for creating annotated Tandem Repeats for human reference genomes version GRCh37, GRCh38, and CHM13 based on tandem repeats in the Genome In A Bottle (GIAB) tandem repeat stratifications.

Background

The pipeline is based on the work found in the regions directory of the adotto repo, where the methods used to develop the adotto v1.0 GRCh38 tandem repeat database were developed and documented. Briefly, input tandem repeat bed file (the GIAB AllTandemRepeat.bed.gz in this case) is preprocessed where regions in the bed file smaller than 10bp and larger than 50kb are filtered the remaining regions are expanded by 25 bp on either side. The regions are annotated using TandemRepeatFinder and RepeatMasker with additional annotations calculated using a custom script. See the adotto repo for additional information and see Data Descriptor for a description of the Adotto bed file format. Note that the Adotto GRCh38 bed files generated by AC English and described in the adotto repo makes use of additional tandem repeat sources and annotations. For GRCh38 the bed files listed on the Adotto repo are more complete and better annotated. The objective of this work is to provide a streamline method to generate comparable Adotto bed files for the three commonly used versions of the human reference genome GRCh37, GRCh38, and CHM13.

Usage

To run the pipeline, you'll need Snakemake installed. Once installed, you can initiate the pipeline using the following command:

snakemake --use-conda -j [number of threads]

This will execute the workflow, and Snakemake will automatically handle the creation of environments using Conda for each rule that requires specific software.

Configuration

The pipeline requires a configuration file named config.yaml to specify various parameters and input data. The structure of this configuration file is as follows:

references:
  GRCh37:
    REFURL: "URL_TO_GRCh37_REFERENCE"
    REF_MD5: "MD5_CHECKSUM_FOR_GRCh37"
    GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_GRCh37"
    GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_GRCh37"
  GRCh38:
    REFURL: "URL_TO_GRCh38_REFERENCE"
    REF_MD5: "MD5_CHECKSUM_FOR_GRCh38"
    GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_GRCh38"
    GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_GRCh38"
  CHM13:
    REFURL: "URL_TO_CHM13_REFERENCE"
    REF_MD5: "MD5_CHECKSUM_FOR_CHM13"
    GIABTRURL: "URL_TO_GIAB_TR_STRATIFICATION_FOR_CHM13"
    GIABTR_MD5: "MD5_CHECKSUM_FOR_GIAB_TR_FOR_CHM13"
n_splits: NUMBER_OF_SPLITS_FOR_FASTA
rm_threads: REPEAT_MASKER_THREADS

Note: Replace the placeholders (URL_TO..., MD5_CHECKSUM_FOR..., NUMBER_OF_SPLITS_FOR_FASTA) with the actual values as per your setup. The config/config.yml includes the URLs and MD5s for GIAB hosted resources (excluding CHM13v2.0). For testing purposes .test/config.yml provides urls and md5 for a test dataset (GRCh38 chromosome 21 only).

Configuration Values

  • REFURL: URL to the reference genome.
  • REF_MD5: MD5 checksum for the reference genome to ensure data integrity.
  • GIABTRURL: URL to the GIAB tandem repeat stratification data.
  • GIABTR_MD5: MD5 checksum for the GIAB tandem repeat stratification data.
  • n_splits: Specifies how many parts the reference genome should be split into during the RepeatMasker annotation phase.

Contributing

If you'd like to contribute to the development of this pipeline or report any issues, please submit a pull request or submit an issue.

About

A snakemake based pipeline to build Adotto TR databases

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published