Skip to content
This repository has been archived by the owner on May 28, 2021. It is now read-only.

Source code of Team CAMP's final submission to the 2015 Prostate Cancer DREAM Challenge.

License

Notifications You must be signed in to change notification settings

tum-camp/dream-prostate-cancer-challenge

Repository files navigation

Survival Models for the Prostate Cancer DREAM Challenge

This repository contains an implementation of survival models that have been used by the CAMP team for the Prostate Cancer DREAM Challenge.

‼️ This repository is not actively maintained, please use sebp/scikit-survival instead ‼️

Requirements

All code has only been tested on a Linux-based operating system, therefore we cannot guarantee that the code is going to run on other platforms. The following instructions apply to Linux-based operating systems only.

Minimum Requirements

  • Python 3.3 or later
  • IPython and IPython notebook 3.1 or later
  • numexpr
  • numpy 1.9 or later
  • pandas 0.15.2 (patched, see below)
  • scikit-learn 0.16.1
  • scipy 0.15 or later
  • six
  • C/C++ compiler
  • rpy2 2.6.0
  • R 3.2 with the following packages installed:
    • randomForestSRC
    • mboost
    • timeROC

Patching Pandas 0.15.2

Recent versions of the pandas Python package changed the way how categorical variables are handled, therefore this code is known to work only with a patched version of pandas 0.15.2. The following two patches need to be applied to pandas 0.15.2:

Extended Requirements

Some non-essential parts of the code depend on additional libraries:

  • MongoDB 2.4
  • pymongo
  • matplotlib
  • seaborn 0.5.1
  • VIM package in R

Getting Started

The easiest way to setup a R and Python environment is to use Anaconda to install all dependencies. The script below sets up a new environment from scratch under Linux.

# Install Miniconda3
wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
chmod +x miniconda.sh
./miniconda.sh -b
export PATH=~/miniconda3/bin:$PATH
conda update --yes conda
conda create --yes --name dream-env python=3.4
conda install --yes --name dream-env -c r --file requirements-conda.txt
source activate dream-env

# Install patched version of pandas
wget https://github.com/pydata/pandas/archive/v0.15.2.tar.gz -O pandas-0.15.2.tar.gz
tar xzvf pandas-0.15.2.tar.gz
cd pandas-0.15.2
wget https://github.com/pydata/pandas/commit/c98dcdf8479b879d2d77d7366109334ba125404b.patch -O bug1.patch
wget https://github.com/pydata/pandas/commit/c97238c2e3b9475b0e30ab7b68ebcf1239ddcc10.patch -O bug2.patch
patch -p1 -f -i bug1.patch
patch -p1 -f -i bug2.patch
python setup.py install
cd ..

# Install additional R packages
R -e 'install.packages(c("mboost", "timeROC", "randomForestSRC", "VIM"), repos="http://cran.r-project.org", dependencies=TRUE)'

# Install rpy2
pip install rpy2

# Install seaborn
# (do not install it from anaconda, since it would pull in a different pandas version)
pip install seaborn==0.5.1

Once you setup your build environment, you have to compile the C/C++ extensions and install the package by running:

python setup.py install

Documentation

API documentation can be generated from the source code using sphinx 1.2.3. Note that version 1.3 or later is known not to work.

cd doc
PYTHONPATH="..:sphinxext" sphinx-autogen api.rst
make html
xdg-open _build/html/index.html

Scripts

The scripts folder contains Python scripts that provide entry points for our analyses. All scripts will print a list of arguments if called with --help from the command line.

Cross-Validation

We provide scripts to perform cross-validation for various models and evaluate them using the challenge's preferred evaluation criterion. Validation is performed in parallel using IPython.parallel. Therefore, access to an IPython cluster, which can run locally, is necessary. The easiest way to start a cluster is to run ipcluster start on the local machine. The following scripts are available:

  • validate-survival.py: Evaluates models for survival analysis (subchallenge 1a).
  • validate-regression.py: Evaluates models for predicting time of death (subchallenge 1b).
  • validate-classifier.py: Evaluates models for classification (subchallenge 2).

For instance, to do cross-validation for data from the ASCENT2 study using random survival forest, the call would look like the following:

python validate-survival.py -m rsf --event DEATH --time LKADT_P --outcome "1" \
--metric timeroc -i data/q1/train_q1_ASCENT2-imputed.arff \
-p param_grid/q1a/rsf_param_grid.json

Subchallenge 1a

Our submission for subchallenge 1b was generated by the script model_1a_survival.py.

# Start MongoDB to cache results
mongod --bind_ip "127.0.0.1" --journal --nohttpinterface --dbpath ${DBPATH} --quiet &

# Start IPython cluster
ipcluster start --daemonize

# Train ensemble of models, write them to disk and perform prediction on test data
python scripts/model_1a_survival.py --event DEATH --time LKADT_P --models-dir ensemble_1a \
-i data/q1/train_q1_ASCENT2_CELGENE_EFC6546-imputed.arff \
-t data/test/test_ASCENT2_CELGENE_EFC6546-imputed.arff

Subchallenge 1b

Our submission for subchallenge 1b was generated by the script model_1b_regression.py.

# Start MongoDB to cache results
mongod --bind_ip "127.0.0.1" --journal --nohttpinterface --dbpath ${DBPATH} --quiet &

# Start IPython cluster
ipcluster start --daemonize

# Train ensemble of models, write them to disk and perform prediction on test data
python scripts/model_1b_regression.py --event DEATH --time LKADT_P --models-dir ensemble_1b \
-i data/q1/train_q1_ASCENT2_CELGENE_EFC6546-imputed.arff \
-t data/test/test_ASCENT2_CELGENE_EFC6546-imputed.arff

Subchallenge 2

Your submission for subchallenge 2 was generated by the script model_2_classification.py.

# Start MongoDB to cache results
mongod --bind_ip "127.0.0.1" --journal --nohttpinterface --dbpath ${DBPATH} --quiet &

# Start IPython cluster
ipcluster start --daemonize

# Train ensemble of models, write them to disk and perform prediction on test data
python scripts/model_2_classification.py --event DISCONT --models-dir ensemble_2 \
-i data/q2/train_q2_ASCENT2_CELGENE_EFC6546-imputed.arff \
-t data/test/test_and_leaderboard_ASCENT2_CELGENE_EFC6546-imputed.arff

Datesets

Datasets generated from raw CSV files are available from the data directory, where each ARFF file contains one partition of the data with its respective set of features:

Study Patients Features (Testing) Features (Imputation) Complete Cases
ASCENT2 476 223 242 78.8%
CELGENE 526 383 421 57.0%
EFC6546 598 350 388 64.0%
ASCENT2 + CELGENE 1,002 221 237 92.7%
ASCENT2 + EFC6546 1,074 220 236 92.1%
CELGENE + EFC6546 1,124 345 366 77.0%
All 1,600 217 233 93.9%

Scripts contained in the notebooks folder can be used to generate datasets from raw data (run ipython notebook). First, one has to execute the notebook DREAM_Prostate_Cancer.ipynb by following the instructions within the notebook. Imputation is performed by the notebook DREAM_Prostate_Cancer_Imputation.ipynb. The result are 7 ARFF files of training data for subchallenge 1a/b and 2, respectively, and 7 ARFF files of the challenges test data.

About

Source code of Team CAMP's final submission to the 2015 Prostate Cancer DREAM Challenge.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published