This repository contains an implementation of survival models that have been used by the CAMP team for the Prostate Cancer DREAM Challenge.
‼️ This repository is not actively maintained, please use sebp/scikit-survival instead ‼️
All code has only been tested on a Linux-based operating system, therefore we cannot guarantee that the code is going to run on other platforms. The following instructions apply to Linux-based operating systems only.
- Python 3.3 or later
- IPython and IPython notebook 3.1 or later
- numexpr
- numpy 1.9 or later
- pandas 0.15.2 (patched, see below)
- scikit-learn 0.16.1
- scipy 0.15 or later
- six
- C/C++ compiler
- rpy2 2.6.0
- R 3.2 with the following packages installed:
- randomForestSRC
- mboost
- timeROC
Recent versions of the pandas Python package changed the way how categorical variables are handled, therefore this code is known to work only with a patched version of pandas 0.15.2. The following two patches need to be applied to pandas 0.15.2:
- BUG: closes bug in apply when function returns categorical
- BUG: concat on axis=0 with categorical (GH10177)
Some non-essential parts of the code depend on additional libraries:
- MongoDB 2.4
- pymongo
- matplotlib
- seaborn 0.5.1
- VIM package in R
The easiest way to setup a R and Python environment is to use Anaconda to install all dependencies. The script below sets up a new environment from scratch under Linux.
# Install Miniconda3
wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
chmod +x miniconda.sh
./miniconda.sh -b
export PATH=~/miniconda3/bin:$PATH
conda update --yes conda
conda create --yes --name dream-env python=3.4
conda install --yes --name dream-env -c r --file requirements-conda.txt
source activate dream-env
# Install patched version of pandas
wget https://github.com/pydata/pandas/archive/v0.15.2.tar.gz -O pandas-0.15.2.tar.gz
tar xzvf pandas-0.15.2.tar.gz
cd pandas-0.15.2
wget https://github.com/pydata/pandas/commit/c98dcdf8479b879d2d77d7366109334ba125404b.patch -O bug1.patch
wget https://github.com/pydata/pandas/commit/c97238c2e3b9475b0e30ab7b68ebcf1239ddcc10.patch -O bug2.patch
patch -p1 -f -i bug1.patch
patch -p1 -f -i bug2.patch
python setup.py install
cd ..
# Install additional R packages
R -e 'install.packages(c("mboost", "timeROC", "randomForestSRC", "VIM"), repos="http://cran.r-project.org", dependencies=TRUE)'
# Install rpy2
pip install rpy2
# Install seaborn
# (do not install it from anaconda, since it would pull in a different pandas version)
pip install seaborn==0.5.1
Once you setup your build environment, you have to compile the C/C++ extensions and install the package by running:
python setup.py install
API documentation can be generated from the source code using sphinx 1.2.3. Note that version 1.3 or later is known not to work.
cd doc
PYTHONPATH="..:sphinxext" sphinx-autogen api.rst
make html
xdg-open _build/html/index.html
The scripts folder contains Python scripts that provide entry points for our analyses.
All scripts will print a list of arguments if called with --help
from the command line.
We provide scripts to perform cross-validation for various models and evaluate them using
the challenge's preferred evaluation criterion. Validation is performed in parallel using
IPython.parallel
. Therefore, access to an IPython cluster, which can run locally, is necessary.
The easiest way to start a cluster is to run ipcluster start
on the local machine.
The following scripts are available:
validate-survival.py
: Evaluates models for survival analysis (subchallenge 1a).validate-regression.py
: Evaluates models for predicting time of death (subchallenge 1b).validate-classifier.py
: Evaluates models for classification (subchallenge 2).
For instance, to do cross-validation for data from the ASCENT2 study using random survival forest, the call would look like the following:
python validate-survival.py -m rsf --event DEATH --time LKADT_P --outcome "1" \
--metric timeroc -i data/q1/train_q1_ASCENT2-imputed.arff \
-p param_grid/q1a/rsf_param_grid.json
Our submission for subchallenge 1b was generated by the script model_1a_survival.py
.
# Start MongoDB to cache results
mongod --bind_ip "127.0.0.1" --journal --nohttpinterface --dbpath ${DBPATH} --quiet &
# Start IPython cluster
ipcluster start --daemonize
# Train ensemble of models, write them to disk and perform prediction on test data
python scripts/model_1a_survival.py --event DEATH --time LKADT_P --models-dir ensemble_1a \
-i data/q1/train_q1_ASCENT2_CELGENE_EFC6546-imputed.arff \
-t data/test/test_ASCENT2_CELGENE_EFC6546-imputed.arff
Our submission for subchallenge 1b was generated by the script model_1b_regression.py
.
# Start MongoDB to cache results
mongod --bind_ip "127.0.0.1" --journal --nohttpinterface --dbpath ${DBPATH} --quiet &
# Start IPython cluster
ipcluster start --daemonize
# Train ensemble of models, write them to disk and perform prediction on test data
python scripts/model_1b_regression.py --event DEATH --time LKADT_P --models-dir ensemble_1b \
-i data/q1/train_q1_ASCENT2_CELGENE_EFC6546-imputed.arff \
-t data/test/test_ASCENT2_CELGENE_EFC6546-imputed.arff
Your submission for subchallenge 2 was generated by the script model_2_classification.py
.
# Start MongoDB to cache results
mongod --bind_ip "127.0.0.1" --journal --nohttpinterface --dbpath ${DBPATH} --quiet &
# Start IPython cluster
ipcluster start --daemonize
# Train ensemble of models, write them to disk and perform prediction on test data
python scripts/model_2_classification.py --event DISCONT --models-dir ensemble_2 \
-i data/q2/train_q2_ASCENT2_CELGENE_EFC6546-imputed.arff \
-t data/test/test_and_leaderboard_ASCENT2_CELGENE_EFC6546-imputed.arff
Datasets generated from raw CSV files are available from the data
directory, where
each ARFF file contains one partition of the data with its respective set of features:
Study | Patients | Features (Testing) | Features (Imputation) | Complete Cases |
---|---|---|---|---|
ASCENT2 | 476 | 223 | 242 | 78.8% |
CELGENE | 526 | 383 | 421 | 57.0% |
EFC6546 | 598 | 350 | 388 | 64.0% |
ASCENT2 + CELGENE | 1,002 | 221 | 237 | 92.7% |
ASCENT2 + EFC6546 | 1,074 | 220 | 236 | 92.1% |
CELGENE + EFC6546 | 1,124 | 345 | 366 | 77.0% |
All | 1,600 | 217 | 233 | 93.9% |
Scripts contained in the notebooks
folder can be used to generate datasets
from raw data (run ipython notebook
).
First, one has to execute the notebook DREAM_Prostate_Cancer.ipynb
by following the instructions within the notebook.
Imputation is performed by the notebook DREAM_Prostate_Cancer_Imputation.ipynb
.
The result are 7 ARFF files of training data for subchallenge 1a/b and 2,
respectively, and 7 ARFF files of the challenges test data.