vsrp

This is a bash pipeline that converts NetAffx annotation and sample files (TSV format, from either NA29 or NA32), into sorted, Gunzipped VCF files for each chromosome - the required input file format for the Michigan Imputation Server.

The proposed pipeline builds on top of the data processing pipeline as described on Michigan Imputation Server Docs

Steps taken by the merge command:

Remove probesets with unknwon rsID, chromosome or position
Compare probesets from NA29 and NA32:
- Probesets with the same ID but different rsID are removed
- For probesets with the same ID and rsID, chromosome and position from NA32 are used. If NA29 and NA32 alleles differ, this program checks if it is due to a strand reversal. If so, NA29 alleles are updated to match NA32 alleles. If not, both sets of alleles are kept.
Write the merged annotation into a TSV file with the following columns:
1. Probe Set ID
2. rsID
3. Chromosome
4. Chromosomal Position
5. NA29 A Allele
6. NA29 B Allele
7. NA32 A Allele
8. NA32 B Allele

Steps taken by the run command:

Convert sample and annotation files from TSV format to PLINK formats
Perform quality control* and convert to single chromosome VCF files using the pipeline recommended by Michigan Imputation Server.
Sort and gunzip VCF files using BCFTools

Requirements:

Python 3.5 or newer
plink and perl installed and executable added to path
HRC-1000G-check-bim.pl (v4.3.0) & the tab delimited HRC reference, or wget and about 700Mb of disk space
BCFTools installed and executable added to path

Usage:

Merge:

python NetAffx2Imputation.py merge [-h] [-o O] anno29 anno32

positional arguments:
  anno29      CSV annotation file from NA29
  anno32      CSV annotation file from NA32

optional arguments:
  -h, --help  show this help message and exit
  -o O        Name of the output merged annotation file. Default is 'merged_annotation.tsv'

Run:

python NetAffx2Imputation.py run [-h] [--hrc HRC] [-t T] [-o O] [-r] anno samples group

positional arguments:
  anno        TSV annotation file
  samples     Directory containing sample TSV files
  group       File specifying the phenotype group and sex of each sample

optional arguments:
  -h, --help  show this help message and exit
  --hrc HRC   Directory containing HRC-1000G-check-bim-v4.3.0.zip & the tab delimited HRC reference. If not included, the required files will be
              downloaded into the current directory in hrc/.
  -t T        Allele difference threshold value. Default is 0.2.
  -o O        Output directory of vcf.gz files
  -r          Remove temporary files

Sample Run:

python NetAffx2Imputation.py merge Axiom_GW_Hu_SNP.r2.na29.annot.csv Axiom_GW_Hu_SNP.na32.annot.csv -o merged_anno.tsv

python NetAffx2Imputation.py run <annotation_tsv_file> <control_sample_dir> <case_annotation_tsv_file> --hrc ./hrc/

Execution Time:

No. of Samples	Test 1 /s	Test 2 /s	Test 3 /s
40	4m25.254s	4m42.964s	4m29.313s
100	6m45.031s	6m53.969s	6m38.285s
200	12m5.040s	12m52.375s	12m39.313s

Results are from the bash time command (real time)

*Quality Control checks:

Checks that the number of variants in the PLINK files match; variants are coded as ACTG bases.
Match the position and rsID of SNPs in the PLINK files to the Haplotype Reference Consortium reference panel (v1.1 HRC.r1-1.GRCh37.wgs.mac5.sites.tab):
- For a SNP, if its rsID exists in the reference but its position is mismatched, the position of the SNP will be corrected if the rsID is unique in the list. SNPs with other mismatch cases are removed.
- Otherwise, if a SNP has a valid position in the reference but its rsID cannot be found, its rsID will be updated based on the reference.
- SNPs with other cases of mismatches/duplicated SNPs are removed.
Based on the reference, SNPs identified as indels are excluded.
Match SNP alleles with that of the reference.
- SNPs with swapped ref/alt alleles are corrected
- SNPs that are found to be mapped to the wrong strand are corrected
- SNPs with differing alleles are removed
- Palindromic SNPs with MAF > 0.4 are removed (Can be disabled)
- SNPs exhibiting MAF that differs from the reference MAF by > 0.2 are removed (Can be disabled / Threshold can be adjusted by specifying -t)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
NetAffx2Imputation.py		NetAffx2Imputation.py
README.md		README.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vsrp

Steps taken by the merge command:

Steps taken by the run command:

Requirements:

Usage:

Merge:

Run:

Sample Run:

Execution Time:

*Quality Control checks:

About

Releases

Packages

Languages

bscrow/vsrp

Folders and files

Latest commit

History

Repository files navigation

vsrp

Steps taken by the merge command:

Steps taken by the run command:

Requirements:

Usage:

Merge:

Run:

Sample Run:

Execution Time:

*Quality Control checks:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages