Notes and code for running UK Biobank GWAS at the MRC IEU
Please note the pipeline is built on University of Bristol infrastructure and this documentation is also for internal use only.
For external researchers - please refer to UK Biobank Genetic Data: MRC-IEU Quality Control, version 2 if you are interested in building the pipeline locally.
- Request directories and permissions for GWAS pipeline to be set up by IEU data manager if not already done
- Create input files in your RDSF input directory
- Wait for files to be copied over to BC4
- On BC4, clone this repo and get .env
- Run the GWAS submission script
- A job is submitted to the queue that QCs the files, and then creates a new submission job for the GWAS
- Wait for GWAS to complete
and output files to sync back to RDSF
RDSF (backed-up)
- Input:
/projects/MRC-IEU/research/data/ukbiobank/software/gwas_pipeline/dev/release_candidate/data/phenotypes/<your_username>/input
- Output (deprecated):
/projects/MRC-IEU/research/data/ukbiobank/software/gwas_pipeline/dev/release_candidate/data/phenotypes/<your_username>/output
BC4 (not backed-up)
- Input (read-only):
/mnt/storage/private/mrcieu/research/UKBIOBANK_GWAS_Pipeline/data/phenotypes/<your_username>/input
- Output:
/mnt/storage/private/mrcieu/research/UKBIOBANK_GWAS_Pipeline/data/phenotypes/<your_username>/output
Create jobs.csv
in RDSF Input directory, containing information on GWAS jobs
- all column names must be present
- if no value, provide empty entry e.g.
,,
- for multiple covariates, separate using
;
name,application_id,pheno_file,pheno_col,covar_file,covar_col,qcovar_col,method
test,123,test.txt,test_name,bolt_covariates.txt,sex;chip,age,bolt
test2,123,test.txt,test_name,bolt_covariates.txt,sex;chip,age,bolt
- Each gwas job is first checked to make sure both phenotype and covariate files exist in correct format and contain specified columns.
- If all good, submission script is created and run as a new slurm job
Create phenotype and covariate files, and place them in RDSF input directory as before.
- see https://github.com/MRCIEU/BiobankPhenotypes/wiki#phenotype-files for details
The input files will be synced to BC4 Input directory.
- Set up GitHub SSH keys
- Clone repo to home or work directory on BC4
git clone [email protected]:MRCIEU/UKBiobankGWAS.git
- Move into the directory
cd UKBiobankGWAS
- Copy the .env file to this repository
cp /mnt/storage/private/mrcieu/research/UKBIOBANK_GWAS_Pipeline/scripts/.env ./
Run from within the repository
sbatch scripts/ukb_gwas.sh
- by default this will run the first row in
jobs.csv
- can specify rows using 0 based indexing, so row 3 is 2, e.g.
sbatch scripts/ukb_gwas.sh 2
Run from within this repository
for i in {0..1}; do echo $i; sbatch scripts/ukb_gwas.sh $i; done
Can generate summary files and parse to create counts:
sbatch UKBiobankGWAS/scripts/summary.sh
python UKBiobankGWAS/scripts/summary_parser.py
The outputs will be saved in BC4 output directory.
Please note that from May 2023, the outputs will not be synced back automatically to RDSF output directory due to storage shortage, but you may do this manually and still use this RDSF location if necessary. Existing outputs on RDSF will be kept until further notice.
- add args to allow only qc step
- add plink