Skip to content

mcveanlab/TreeWAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TreeWAS

https://travis-ci.org/mcveanlab/TreeWAS.svg?branch=master https://img.shields.io/badge/License-MIT-yellow.svg

This repository contains R code to perform TreeWAS analysis. For a full description of the method see preprint here or available within this repository.

Install

  • The code is available as an R package. To install the package clone the repository and install:
R CMD INSTALL TreeWAS

or alternatively, do this from within an R session

library(devtools)
install_github("mcveanlab/TreeWAS/TreeWAS")

Input File Formats

Data for TreeWAS analysis is encoded in three files.

Sample file

Assumes two columns with header. First column has samples IDs and second column contains either a genotype (0/1/2) or a continuous variable such as a genetic risk score (GRS). Sample ID must be the first column.

ID GRS
1 2.91109682117209
2 3.58725581286855
3 4.0187625426125
4 4.25960701964853
5 3.38657230426473
6 4.27097017945458
7 3.72455637560711

Phenotype file

A file with multiple columns and a header. One row per individual. The column ID contains sample IDs and it doesn’t need to be the first column. All other columns are assumed to be phenotypes coded as 0 or 1, column names are phenotype codes which should be present in the tree file (matching column coding in the tree file).

ID R198 R104 M512 L720 M8414 L031 K802 S662 K267
1 1 1 1 0 0 0 0 0 0 
2 0 0 0 1 0 0 0 0 0 
3 0 0 0 0 1 0 0 0 0 
4 0 0 0 0 0 1 0 0 0 
5 0 0 0 0 0 0 1 1 0 
6 0 1 0 0 0 0 0 0 1 
7 0 0 0 0 0 0 0 0 0 
8 0 0 0 0 0 0 0 0 0 
9 0 0 0 0 0 0 0 0 0 

Tree file

File defining the topology of the diagnosis tree. These files can be downloaded from the UK Biobank Showcase website. For example, the tree describing the encoding of Non-cancer Illnesses (data-field 20002) is encoded using Data-Coding 6 of the UK Biobank. This file can be downloaded here. File is assumed to be tab-delimited. A function provided in the package will parse and sort the tree for TreeWAS analysis. The first few lines of the tree for Data-Coding 19 (corresponding to the ICD-10 tree) are shown below.

coding	meaning	node_id	parent_id	selectable
A00	A00 Cholera	286	23	N
A000	A00.0 Cholera due to Vibrio cholerae 01, biovar cholerae	287	286	Y
A001	A00.1 Cholera due to Vibrio cholerae 01, biovar el tor	288	286	Y
A009	A00.9 Cholera, unspecified	289	286	Y
A01	A01 Typhoid and paratyphoid fevers	290	23	N
A010	A01.0 Typhoid fever	291	290	Y
A011	A01.1 Paratyphoid fever A	292	290	Y
A012	A01.2 Paratyphoid fever B	293	290	Y
A013	A01.3 Paratyphoid fever C	294	290	Y

Sample inclusion file

A list of sample IDs to include in the analysis can be parsed to the script with the --keep argument. We assume one sample ID per line.

Running TreeWAS

Three scripts are provided to run TreeWAS analysis with different type of genetic variation and/or genetic models.

Analysing a genetic risk score

The script grs_tree_analysis.R performs TreeWAS analysis on a GRS. The script takes the following arguments:

ArgumentDescription
sample_fileSample file. Cannot be null.
pheno_filePhenotype file. Cannot be null.
tree_fileTree file. Cannot be null.
outprefixPrefix to use for results filenames. Defaults to “out”.
thetaPrior on the mutation rate. Defaults to 1/3.
p1Prior on the proportion of active nodes in the tree. Defaults to 0.001.
keepSample inclusion filename. Optional.
num.coresNumber of cores to use. Defaults to 1. If greater than one the parallel package will be used.
b1_max_magThe prior is symmetric around zero. This parameter controls the range of the effect size.
b1_spacThe grid size.

To do a GRS analysis on the test data, use the following command.

./scripts/grs_tree_analysis.R \
    --sample_file=example_data/sample_file_grs.txt \
    --tree_file=example_data/tree_example_ICD10_Chap_VI.txt \
    --pheno_file=example_data/phenotype_file.txt \
    --outprefix=test_grs.res \
    --num.cores=1

Case-control study

The scripts cc_snp_tree_analysis.R and cc_snp_tree_analysis_additive.R perform case-control association analysis. The scripts take the following arguments:

ArgumentDescription
sample_fileSample file. Cannot be null.
pheno_filePhenotype file. Cannot be null.
tree_fileTree file. Cannot be null.
outprefixPrefix to use for results filenames. Defaults to “out”.
thetaPrior on the mutation rate. Defaults to 1/3.
p1Prior on the proportion of active nodes in the tree. Defaults to 0.001.
keepSample inclusion filename. Optional.
num.coresNumber of cores to use. Defaults to 1. If greater than one the parallel package will be used.
b{1,2}_max_magThe prior is symmetric around zero. This parameter controls the range of the effect sizes (b1 for the het genotype and b2 for the hom).
b{1,2}_spacThe grid size.

To Run the analysis with the test data fitting an additive model do:

./scripts/cc_snp_tree_analysis_additive.R \
    --sample_file='example_data/sample_file_gen.txt' \
    --tree_file='example_data/tree_example_ICD10_Chap_VI.txt' \
    --pheno_file='example_data/phenotype_file.txt' \
    --outprefix='test_gen.res' \
    --b1_max_mag=2 \
    --b1_spac=0.02 \
    --num.cores=1

or with a full genetic model:

./scripts/cc_snp_tree_analysis.R \
    --sample_file='example_data/sample_file_gen.txt' \
    --tree_file='example_data/tree_example_ICD10_Chap_VI.txt' \
    --pheno_file='example_data/phenotype_file.txt' \
    --outprefix='test_gen2.res' \
    --theta=0.33333 \
    --p1=0.001 \
    --b1_max_mag=3 \
    --b2_max_mag=3 \
    --b1_spac=0.02 \
    --b2_spac=0.02 \
    --num.cores=1

Citation

If you use TreeWAS in your work, please cite us:

Cortes A., et al. (2017) Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. bioRxiv 105122. doi: https://doi.org/10.1101/105122