Install Python
sudo apt update -y
sudo apt install python3.7
- Create python virtual environment with
virtualenv
and install pip dependencies:mkdir -p $HOME/envs # make envs directory and skip if already exists virtualenv -p /usr/bin/python3.7 $HOME/envs/somalogic # create virtual environment with python 3.7 source $HOME/envs/somalogic/bin/activate cd $HOME/PycharmProjects/somalogic pip install -r requirements.txt
-
Set up directory
mkdir -p somalogic # -p flag allows no error returned if directory exists cd somalogic mkdir -p src data doc results ext
-
Download raw SomaLogic zip file from Hydra at
~/projects/richards/restricted/bqc19/data/somalogic/20201120/McGill-Richards-C-19-SomaScan-Data.zip
to your own directory# Download to local computer. While on local machine, run the following: cd data # switch to data folder where file should be downloaded scp -r [email protected]:~/projects/richards/restricted/bqc19/data/somalogic/20201120/McGill-Richards-C-19-SomaScan-Data.zip /mnt/c/Users/user/PyCharmProjects/somalogic/data unzip McGill-Richards-C-19-SomaScan-Data.zip # unzip the file rm McGill-Richards-C-19-SomaScan-Data.zip # remove the zip file
-
Unzip all subdirectories.
cd 'McGill-Richards-C-19 - SomaScan Data' unzip MCG-200150.zip -d ./MCG-200150 # unzip file to location specified unzip SomaDataIO_3_1_0_and_pdf.zip -d ./SomaDataIO_3_1_0_and_pdf rm MCG-200150.zip rm SomaDataIO_3_1_0_and_pdf.zip
-
Unzip
tar.gz
file in.../SomaDataIO_3_1_0_and_pdf
. This will produce a folderSomaDataIO
cd SomaDataIO_3_1_0_and_pdf tar -xf SomaDataIO_3.1.0.tar.gz # -x extract, f filename
-
Follow installation instructions in ./SomaDataIO/README.md file to install SomaLogic software and dependencies for R.
-
Read relevant pdf files to familiarize with SomaLogic data.
Filename (directory location)
SSM-00060 - Rev 1.0 - Data Standardization and File Specification Technical Note.pdf
(McGill-Richards-C-19 - SomaScan Data)SS-200150_SQS.pdf
(MCG-200150)SomaDataIO_3.1.0.pdf
(SomaDataIO_3_1_0_and_pdf)
-
Note: SomaLogic provides us with two
.adat
files containing the protein levels. We are working with the SomaLogic Normalized protein dataset.- "Normalized" dataset:
SS-200150_v4_ACDPlasma.hybNorm.medNormInt.plateScale.calibrate.anmlQC.qcCheck.medNormRefSMP.adat
- "Unnormalized" dataset:
SS-200150_v4_ACDPlasma.hybNorm.medNormInt.plateScale.calibrate.anmlQC.qcCheck.adat
- "Normalized" dataset:
Here, our goal is to run logistic regression in R while on the Hydra cluster. The output of this analysis will be a .xlsx
file of
association results with proteins sorted by increasing p values, odds ratios, and confidence intervals for each protein.
Furthermore, we will generate qq plots of the p values.
-
Create your infectious and non-infectious
.csv
file datasets by merging sample data with SomaLogic protein levels.-
To form one dataset, use
merge_dataset.R
with the following arguments:i. args[1] - whether to use normalized or unnormalized SomaLogic dataset
ii. args[2] - whether to natural log transform proteins or not
iii. args[3] - whether to standardize proteins to mean 0 and std 1
iv. args[4] - whether to remove outliers (protein values with std > 3 or std < -3)
Rscript merge_dataset.R {normalized/unnormalized} {TRUE/FALSE} {TRUE/FALSE} {TRUE/FALSE} # To use the normalized SomaLogic dataset with proteins natural log transformed, unstandardized, and with no outliers removed Rscript merge_dataset.R normalized TRUE FALSE FALSE
This automatically creates the infectious and non-infectious dataset.
- To form all possible datasets
bash merge_dataset.sh
which will create 8
.csv
file datasets (4 normalized, 4 unnormalized) for each of the infectious and non-infectious data for a total of 16 datasets stored in./results/datasets/
-
-
Next, set up directory for saving results of association analysis
cd results mkdir -p all_proteins cd all_proteins mkdir -p age+sex+SampleGroup+ProcessTime+protlevel cd age+sex+SampleGroup+ProcessTime+protlevel mkdir -p pvalues qqplots # for storing pvalues text files and qq plot images, respectively
-
Run association analyses for each dataset and each outcome. This will generate two files
somalogic/results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/_{infe/non_infe}_{A2/A3/B2/C1}_LR_age+sex+SampleGroup+ProcessTime+protlevel_Analysis=all_proteins.xlsx
which contains the odds ratio, confidence intervals, and p values.somalogic/results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues/_{infe/non_infe}_{A2/A3/B2/C1}_LR_age+sex+SampleGroup+ProcessTime+protlevel_pvalues.txt
which takes the second to last column of the.xlsx
file and stores it in a text file for later use in generating the QQ plots.
Rscript associations_analysis.R {infe/non_infe} {A2/A3/B2/C1} # example run for infectious dataset and A2 outcome: Rscript associations_analysis.R infe A2
- To launch jobs on the Hydra cluster:
qsub -v "DATA=infe,OUTCOME=A2" launch_associations.sh # for regression with infectious data on the A2 outcome
-
In the
ext
directory, clone qq.plink repo for plotting qq plots.cd ext git clone https://github.com/vforget/qq.plink
-
Generate qq plots with qq.plink
Rscript ./ext/qq.plink/qq.plink.R $pvalue.txt$ "some title" # example for p value results from infe dataset with A2 outcome Rscript ./ext/qq.plink/qq.plink.R ./results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues/_infe_A2_LR_age+sex+SampleGroup+ProcessTime+protlevel_pvalues.txt "infe A2 age+sex+SampleGroup+ProcessTime+protlevel"
where the first argument
$pvalue.txt$
should be replaced with the saved p value text file and the second argument"some title"
is the title for the plot which should be surrounded in double quotes "". The output will be./results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues/_infe_A2_LR_age+sex+SampleGroup+ProcessTime+protlevel_pvalues.txt.qq.png
- To run all analyses at once, launch
plot_qq.sh
. This will generate QQ plots for all combinations{infe/non_infe} {A2/A3/B2/C1}
bash plot_qq.sh
Finally, move all QQ plots to
qqplots
foldercd results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues mv *.qq.png ../qqplots
- To run all analyses at once, launch
- Our original list contains 5284 SomaLogic SOMAmers.
For this analysis, we remove
NoneX.*
,NonHuman
,Spuriomer
,HybControlElution
,NonBiotin
,NonCleavable
from the list to get a total of 4984 proteins. This text file of proteins is stored assomalogic/data/Somalogic_list_QC1.txt
- Run models. This will perform a hyperparameter search using Stratified 5 fold cross-validation and use the best
hyperparameter to train on the entire training set and save the final model.
Note: after running once with
python models.py --soma_data {normalized/unnormalized} --nat_log_transf {True/False} --standardize {True/False} --data {infe/non_infe} --outcome {A2/A3/B2/C1} --model_type {lasso/elasticnet} --params_search {True/False} # example run python models.py --soma_data normalized --nat_log_transf True --standardize True --data infe --outcome A2 --model_type lasso --params_search True
--params_search True
, the model parameters will be saved as a.pkl
file such assomalogic/results/models/lasso-soma_data=normalized-nat_log_transf=False-standardize=False_infe_A2_results.pkl
if the above command were run. Since the parameters are saved already, subsequent runs should set--params_search False
sincemodels.py
will directly load this.pkl
file.
-
Merge your SomaLogic Normalized dataset with patient information and name this file
test.csv
. Do not perform any preprocessing on this dataset. i.e. no log transformation, scaling, outlier removal of any sort. All protein levels should be the raw values in the original SomaLogic Normalized dataset. -
Define A2 and A3 outcomes within this dataset
-
Store your SomaLogic Normalized dataset under
somalogic/results/datasets/test
.cd somalogic mkdir -p results/datasets mkdir -p results/datasets/test cd results/datasets/test # store test.csv dataset in results/datasets/test directory
-
Test model on A2 and A3 outcomes
python test_models.py --soma_data normalized --nat_log_transf True --standardize True --data infe --outcome {A2/A3} --model_type lasso
File: protein_comparison.ipynb
- Compares proteins in Training and Testing set
- Looks at coefficients of trained models