`Simple wrapper for TPOT`

TPOT is an "auto ML" package that finds a good machine learning pipeline for your data. It is a genetic algorithm that searches over the sklearn classifier algorithms. It does not train the model or do prediction, it's just for finding an appropriate model and set of hyperparameters, as well as preprocessors.

Here are three scripts that run TPOT, train the model, and do prediction. This is a brief outline of what you'll do to run them, but there's more documentation in the scripts themselves.

See example.sh to run an example full pipeline of scan - train - predict.

1.) train_TPOT.py: Run TPOT on training data to find a good pipeline. This can take several days, reduced by limiting the number of classifiers, selectors, transformers, etc to explore. It can be sped up by using more jobs, but we've noticed crashes from too many jobs before. Note that TPOT outputs a python script. I suggest telling TPOT to look at only a subset of the classifier models in sklearn with the flag --classifier_subset, as in the example below. I've found 50 generations with a population size of 50 works pretty well, but results may vary.

python train_TPOT.py --training_data path/to/training_data.csv \
--outfile tpot_output.py \
--classifier_subset sklearn.ensemble.ExtraTreesClassifier sklearn.ensemble.RandomForestClassifier \
--generations 50 --population_size 50 --n_jobs 10 --template Selector-Classifier

2.) After running TPOT, it's now time to train the model and test it on your hold-out set. TPOT has written a python script (called 'tpot_output.py' in the example above).

Option A) train_test_model.py: You should now cut and paste the appropriate parts into the script called train_test_model.py, which will train the model using pipeline from TPOT and test it against your holdout set. It writes a serialized model object, a csv file with precision-recall results on the test set, and a csv file with the prediction results on the test set (ID1,ID2,known_label,classifier_score,FDR)

Option B) train_test_model2.py: Same thing, but will automatically extract appropriate parts of the TPOT output python train_test_model2.py --training_infile /path/to/training_data.csv --test_infile /path/to/test_data.csv --exported_pipeline tpot_test_all.py --id_cols 0

3.) You can predict on your full dataset using the trained, serialized model with tpot_predict.py

-Benjamin Liebeskind and Claire McWhite

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
example_files		example_files
README.md		README.md
all_classifiers.txt		all_classifiers.txt
all_selectors.txt		all_selectors.txt
all_transformers.txt		all_transformers.txt
config_dicts.py		config_dicts.py
example.sh		example.sh
tpot_predict.py		tpot_predict.py
tpot_utils.py		tpot_utils.py
train_TPOT.py		train_TPOT.py
train_test_model.py		train_test_model.py
train_test_model2.py		train_test_model2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`Simple wrapper for TPOT`

About

Releases

Packages

Languages

marcottelab/run_TPOT

Folders and files

Latest commit

History

Repository files navigation

Simple wrapper for TPOT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Simple wrapper for TPOT`

Packages