A pipeline for computing validation based on the positive control primers. It is intended to be run after the HMAS QC Pipeline.
Started by @jessicarowell Active development by @jessicarowell
This pipeline takes 4 input files. The first 3 are outputs from HMAS QC Pipeline.
- A fasta file: this file contains the high-quality consensus reads that passed filtering steps in the HMAS QC Pipeline
- A names file: this text file matches each consensus reads with all the identical reads
- A group file: this text file matches each read that passed QC to it's associated primer pair
- A primer design file: this is a fasta-formatted file with primer pair name as the header, and its associated amplicon as the sequence.
The design file contains amplicons specifically designed as positive controls for our 823 AMD primers. Those 823 primers were designed to walk a set of AMD genes of interest to enterics. For each primer pair, we design ed an amplicon of the same length and GC content using Coliphage phi-X17, Wolbachia pipientis, or Streptomyces coelicolor (depending on target GC content).
In a perfect world, we expect the primers to bind and amplify the amplicons they were designed to target. So in our final, high-quality fasta file, we hope the reads and their associated primer pair will match the desig n file. A blast search against this reference database built from the amplicons in our design file helps us evaluate how well our primers hit their expected targets.
The pipeline runs the following steps:
- It creates a blast database from the primer design file. (this is the reference database)
- It blasts the final, post-QC fasta file against this reference database. It allows a maximum of 20 hits.
- Create a SQL database from the group file {read: primer pair name}. We will search against this database.
- Get and store the number of consensus reads for each unique reads in the final, post-QC fasta file.
- Collate all of this information so we have the primer pair and number of consensus reads for each unique read in our final fasta file.
- Merge our final reads with the blast results to find out the following information:
- How many reads perfectly match with their intended primer pairs in the design file?
- How many reads don't match anything in the design file?
- How many reads match more than just their intended primer pairs?
-
Python 3 or higher. Download python here.
-
You must have pandas installed.
To be completed...
(Note: I haven't elaborated here because these instructions will change when we containerize the pipeline.)
-
Copy the Github repository to a folder
git clone https://github.com/jessicarowell/HMAS-Internal-Validation.git
-
Add
hmas_validation.py
to your $PATH
To be elaborated...
hmas_validation.py -h
hmas_validation.py -f twist.final.fasta -r amr_design_primers.fasta -g twist.final.groups -n twist.final.names -o amr2020
(Note: this section might also change depending on how we package this.)
Please feel free to fork this repo, make improvements, and share them with me.
Please also post any issues you encounter to Github and I'll be sure to look into them as soon as I can.
We plan to containerize this pipeline in the future.