SomaLogic

Installation

Install Python

sudo apt update -y
sudo apt install python3.7

Python Environment

Create python virtual environment with virtualenv and install pip dependencies:

mkdir -p $HOME/envs  # make envs directory and skip if already exists
virtualenv -p /usr/bin/python3.7 $HOME/envs/somalogic  # create virtual environment with python 3.7
source $HOME/envs/somalogic/bin/activate
cd $HOME/PycharmProjects/somalogic
pip install -r requirements.txt

Intial Analysis

Basic SomaLogic Data Download and Directory Setup

Set up directory

mkdir -p somalogic  # -p flag allows no error returned if directory exists
cd somalogic
mkdir -p src data doc results ext

Download raw SomaLogic zip file from Hydra at ~/projects/richards/restricted/bqc19/data/somalogic/20201120/McGill-Richards-C-19-SomaScan-Data.zip to your own directory

# Download to local computer. While on local machine, run the following:
cd data  # switch to data folder where file should be downloaded
scp -r [email protected]:~/projects/richards/restricted/bqc19/data/somalogic/20201120/McGill-Richards-C-19-SomaScan-Data.zip /mnt/c/Users/user/PyCharmProjects/somalogic/data
unzip McGill-Richards-C-19-SomaScan-Data.zip  # unzip the file
rm McGill-Richards-C-19-SomaScan-Data.zip  # remove the zip file

Unzip all subdirectories.

cd 'McGill-Richards-C-19 - SomaScan Data'
unzip MCG-200150.zip -d ./MCG-200150  # unzip file to location specified
unzip SomaDataIO_3_1_0_and_pdf.zip -d ./SomaDataIO_3_1_0_and_pdf  
rm MCG-200150.zip
rm SomaDataIO_3_1_0_and_pdf.zip

Unzip tar.gz file in .../SomaDataIO_3_1_0_and_pdf. This will produce a folder SomaDataIO

cd SomaDataIO_3_1_0_and_pdf
tar -xf SomaDataIO_3.1.0.tar.gz  # -x extract, f filename

Follow installation instructions in ./SomaDataIO/README.md file to install SomaLogic software and dependencies for R.
Read relevant pdf files to familiarize with SomaLogic data. Filename (directory location)

SSM-00060 - Rev 1.0 - Data Standardization and File Specification Technical Note.pdf (McGill-Richards-C-19 - SomaScan Data)
SS-200150_SQS.pdf (MCG-200150)
SomaDataIO_3.1.0.pdf (SomaDataIO_3_1_0_and_pdf)

Association Analysis (in R)

Note: SomaLogic provides us with two .adat files containing the protein levels. We are working with the SomaLogic Normalized protein dataset.
- "Normalized" dataset: SS-200150_v4_ACDPlasma.hybNorm.medNormInt.plateScale.calibrate.anmlQC.qcCheck.medNormRefSMP.adat
- "Unnormalized" dataset: SS-200150_v4_ACDPlasma.hybNorm.medNormInt.plateScale.calibrate.anmlQC.qcCheck.adat

Here, our goal is to run logistic regression in R while on the Hydra cluster. The output of this analysis will be a .xlsx file of association results with proteins sorted by increasing p values, odds ratios, and confidence intervals for each protein. Furthermore, we will generate qq plots of the p values.

Create your infectious and non-infectious .csv file datasets by merging sample data with SomaLogic protein levels.
- To form one dataset, use merge_dataset.R with the following arguments:
  
  i. args[1] - whether to use normalized or unnormalized SomaLogic dataset
  
  ii. args[2] - whether to natural log transform proteins or not
  
  iii. args[3] - whether to standardize proteins to mean 0 and std 1
  
  iv. args[4] - whether to remove outliers (protein values with std > 3 or std < -3)
```
Rscript merge_dataset.R {normalized/unnormalized} {TRUE/FALSE} {TRUE/FALSE} {TRUE/FALSE} 

# To use the normalized SomaLogic dataset with proteins natural log transformed, unstandardized, and with no outliers removed
Rscript merge_dataset.R normalized TRUE FALSE FALSE
```
This automatically creates the infectious and non-infectious dataset.
- To form all possible datasets
```
bash merge_dataset.sh
```
which will create 8 .csv file datasets (4 normalized, 4 unnormalized) for each of the infectious and non-infectious data for a total of 16 datasets stored in ./results/datasets/

Next, set up directory for saving results of association analysis

cd results
mkdir -p all_proteins
cd all_proteins
mkdir -p age+sex+SampleGroup+ProcessTime+protlevel
cd age+sex+SampleGroup+ProcessTime+protlevel
mkdir -p pvalues qqplots  # for storing pvalues text files and qq plot images, respectively

Run association analyses for each dataset and each outcome. This will generate two files
- somalogic/results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/_{infe/non_infe}_{A2/A3/B2/C1}_LR_age+sex+SampleGroup+ProcessTime+protlevel_Analysis=all_proteins.xlsx which contains the odds ratio, confidence intervals, and p values.
- somalogic/results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues/_{infe/non_infe}_{A2/A3/B2/C1}_LR_age+sex+SampleGroup+ProcessTime+protlevel_pvalues.txt which takes the second to last column of the .xlsx file and stores it in a text file for later use in generating the QQ plots.
```
 Rscript associations_analysis.R {infe/non_infe} {A2/A3/B2/C1}
 # example run for infectious dataset and A2 outcome: Rscript associations_analysis.R infe A2
```
- To launch jobs on the Hydra cluster:
```
qsub -v "DATA=infe,OUTCOME=A2" launch_associations.sh  # for regression with infectious data on the A2 outcome
```

In the ext directory, clone qq.plink repo for plotting qq plots.

cd ext
git clone https://github.com/vforget/qq.plink

Generate qq plots with qq.plink
```
Rscript ./ext/qq.plink/qq.plink.R $pvalue.txt$ "some title"  

# example for p value results from infe dataset with A2 outcome
Rscript ./ext/qq.plink/qq.plink.R ./results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues/_infe_A2_LR_age+sex+SampleGroup+ProcessTime+protlevel_pvalues.txt "infe A2 age+sex+SampleGroup+ProcessTime+protlevel"
```
where the first argument $pvalue.txt$ should be replaced with the saved p value text file and the second argument "some title" is the title for the plot which should be surrounded in double quotes "". The output will be ./results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues/_infe_A2_LR_age+sex+SampleGroup+ProcessTime+protlevel_pvalues.txt.qq.png
- To run all analyses at once, launch plot_qq.sh. This will generate QQ plots for all combinations {infe/non_infe} {A2/A3/B2/C1}
```
bash plot_qq.sh
```
Finally, move all QQ plots to qqplots folder
```
cd results/all_proteins/age+sex+SampleGroup+ProcessTime+protlevel/pvalues
mv *.qq.png ../qqplots
```

Training

Our original list contains 5284 SomaLogic SOMAmers. For this analysis, we remove NoneX.*, NonHuman, Spuriomer, HybControlElution, NonBiotin, NonCleavable from the list to get a total of 4984 proteins. This text file of proteins is stored as somalogic/data/Somalogic_list_QC1.txt

Regularized Association Analysis with different covariates

Run models. This will perform a hyperparameter search using Stratified 5 fold cross-validation and use the best hyperparameter to train on the entire training set and save the final model.
```
python models.py --soma_data {normalized/unnormalized} --nat_log_transf {True/False} --standardize {True/False} --data {infe/non_infe} --outcome {A2/A3/B2/C1} --model_type {lasso/elasticnet} --params_search {True/False}

# example run 
python models.py --soma_data normalized --nat_log_transf True --standardize True --data infe --outcome A2 --model_type lasso --params_search True
```
Note: after running once with --params_search True, the model parameters will be saved as a .pkl file such as somalogic/results/models/lasso-soma_data=normalized-nat_log_transf=False-standardize=False_infe_A2_results.pkl if the above command were run. Since the parameters are saved already, subsequent runs should set --params_search False since models.py will directly load this .pkl file.

Testing the Model

Merge your SomaLogic Normalized dataset with patient information and name this file test.csv. Do not perform any preprocessing on this dataset. i.e. no log transformation, scaling, outlier removal of any sort. All protein levels should be the raw values in the original SomaLogic Normalized dataset.
Define A2 and A3 outcomes within this dataset

Store your SomaLogic Normalized dataset under somalogic/results/datasets/test.

cd somalogic
mkdir -p results/datasets
mkdir -p results/datasets/test
cd results/datasets/test
# store test.csv dataset in results/datasets/test directory

Test model on A2 and A3 outcomes

python test_models.py --soma_data normalized --nat_log_transf True --standardize True --data infe --outcome {A2/A3} --model_type lasso

Comparison

File: protein_comparison.ipynb

Compares proteins in Training and Testing set
Looks at coefficients of trained models

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
data		data
results/models		results/models
0_plotting_figures.ipynb		0_plotting_figures.ipynb
README.md		README.md
associations_analysis.R		associations_analysis.R
changes_severity.ipynb		changes_severity.ipynb
clinical_features.R		clinical_features.R
comorbidities_associations.R		comorbidities_associations.R
days_symptom_onset.ipynb		days_symptom_onset.ipynb
launch_associations.sh		launch_associations.sh
make_dataset.py		make_dataset.py
merge_dataset.R		merge_dataset.R
models.py		models.py
models_2.py		models_2.py
old_models.py		old_models.py
overlap_proteins.ipynb		overlap_proteins.ipynb
plot_model_protein_correlations.R		plot_model_protein_correlations.R
plot_qq.sh		plot_qq.sh
plotting.py		plotting.py
plt_models_2.py		plt_models_2.py
protein_comparison.ipynb		protein_comparison.ipynb
protein_correlations.ipynb		protein_correlations.ipynb
requirements.txt		requirements.txt
sensitivity_analysis.py		sensitivity_analysis.py
sensitivity_analysis1.py		sensitivity_analysis1.py
table1_test.ipynb		table1_test.ipynb
table1_train.ipynb		table1_train.ipynb
test_models.py		test_models.py
train_test_auc_curves.ipynb		train_test_auc_curves.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SomaLogic

Installation

Python Environment

Intial Analysis

Basic SomaLogic Data Download and Directory Setup

Association Analysis (in R)

Training

Regularized Association Analysis with different covariates

Testing the Model

Comparison

About

Releases

Packages

Languages

chenyangsu/somalogic

Folders and files

Latest commit

History

Repository files navigation

SomaLogic

Installation

Python Environment

Intial Analysis

Basic SomaLogic Data Download and Directory Setup

Association Analysis (in R)

Training

Regularized Association Analysis with different covariates

Testing the Model

Comparison

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages