diff --git a/.gitignore b/.gitignore index 332f1f28..d437602b 100644 --- a/.gitignore +++ b/.gitignore @@ -75,3 +75,5 @@ docs/sources/examples/.Rhistory analyze-oj2-tpot-mdr.ipynb tpot-mdr-demo.ipynb + +github.com/ diff --git a/.travis.yml b/.travis.yml index 67beda6c..d01b217f 100644 --- a/.travis.yml +++ b/.travis.yml @@ -5,9 +5,9 @@ env: matrix: # let's start simple: - PYTHON_VERSION="2.7" LATEST="true" - - PYTHON_VERSION="3.4" LATEST="true" - - PYTHON_VERSION="3.5" COVERAGE="true" LATEST="true" "DEAP_VERSION=1.0.1" "XGBOOST_VERSION=0.4a30" - PYTHON_VERSION="3.5" LATEST="true" + - PYTHON_VERSION="3.6" COVERAGE="true" LATEST="true" "DEAP_VERSION=1.0.2" "XGBOOST_VERSION=0.6" + - PYTHON_VERSION="3.6" LATEST="true" install: source ./ci/.travis_install.sh script: bash ./ci/.travis_test.sh after_success: diff --git a/MANIFEST.in b/MANIFEST.in index e6e1e6a3..3ee3ca24 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -1,3 +1,3 @@ -include README.md LICENSE tests.py +include README.md LICENSE tests.py tests.csv recursive-include images * recursive-include tpot *.py diff --git a/README.md b/README.md index 49a09a5e..acdc7577 100644 --- a/README.md +++ b/README.md @@ -143,7 +143,7 @@ exported_pipeline.fit(training_features, training_classes) results = exported_pipeline.predict(testing_features) ``` -Check the documentation for [more examples and tutorials](http://rhiever.github.io/tpot/examples/MNIST_Example/). +Check the documentation for [more examples and tutorials](http://rhiever.github.io/tpot/examples/). ## Contributing to TPOT diff --git a/ci/.travis_install.sh b/ci/.travis_install.sh index 267d8e60..c2830dd4 100755 --- a/ci/.travis_install.sh +++ b/ci/.travis_install.sh @@ -53,6 +53,7 @@ fi pip install update_checker pip install tqdm +pip install pathos if [[ "$COVERAGE" == "true" ]]; then pip install coverage coveralls @@ -67,4 +68,5 @@ python -c "import deap; print('deap %s' % deap.__version__)" python -c "import xgboost; print('xgboost %s ' % xgboost.__version__)" python -c "import update_checker; print('update_checker %s' % update_checker.__version__)" python -c "import tqdm; print('tqdm %s' % tqdm.__version__)" +python -c "import pathos; print('pathos %s' % pathos.__version__)" python setup.py build_ext --inplace diff --git a/ci/.travis_test.sh b/ci/.travis_test.sh index 77b3733b..162d8092 100755 --- a/ci/.travis_test.sh +++ b/ci/.travis_test.sh @@ -17,6 +17,7 @@ python -c "import deap; print('deap %s' % deap.__version__)" python -c "import xgboost; print('xgboost %s ' % xgboost.__version__)" python -c "import update_checker; print('update_checker %s ' % update_checker.__version__)" python -c "import tqdm; print('tqdm %s' % tqdm.__version__)" +python -c "import pathos; print('pathos %s' % pathos.__version__)" if [[ "$COVERAGE" == "true" ]]; then nosetests -s -v --with-coverage diff --git a/docs/citing/index.html b/docs/citing/index.html index 2bf6509e..cd420bc3 100644 --- a/docs/citing/index.html +++ b/docs/citing/index.html @@ -45,95 +45,50 @@ @@ -230,7 +185,7 @@
-

Copyright © 2016-Present Randal S. Olson

+

Copyright © 2015-Present Randal S. Olson

@@ -247,7 +202,7 @@
- GitHub + GitHub « Previous diff --git a/docs/contributing/index.html b/docs/contributing/index.html index d38e5c1c..1ef02632 100644 --- a/docs/contributing/index.html +++ b/docs/contributing/index.html @@ -45,110 +45,65 @@ @@ -191,7 +146,7 @@

Project layout

In terms of directory structure:

Updating the documentation

-

We use mkdocs to manage our documentation. This allows us to write the docs in Markdown and compile them to HTML as needed. Below are a few useful commands to know when updating the documentation. Make sure that you are running them in the base documentation directory, docs.

+

We use mkdocs to manage our project documentation. This allows us to write the documentation in Markdown and compile them to HTML as needed. Below are a couple useful commands to know when updating the documentation. Make sure that you are running these commands in the base directory of the TPOT project.

  • mkdocs serve: Hosts of a local version of the documentation that you can access at the provided URL. The local version will update automatically as you save changes to the documentation.

  • -

    mkdocs build --clean: Creates a fresh build of the documentation in HTML. Always run this before deploying the documentation to GitHub.

    -
  • -
  • -

    mkdocs gh-deploy: Deploys the documentation to GitHub. If you're deploying on your fork of TPOT, the online documentation should be accessible at http://<YOUR GITHUB USERNAME>.github.io/tpot/. Generally, you shouldn't need to run this command because you can view your changes with mkdocs serve.

    +

    mkdocs build --clean: Creates a fresh build of the documentation in HTML in the docs directory. Always run this before pushing the documentation to GitHub.

After submitting your pull request

@@ -317,7 +269,7 @@

After submitting your pull requestNext - Previous + Previous

@@ -327,7 +279,7 @@

After submitting your pull request -

Copyright © 2016-Present Randal S. Olson

+

Copyright © 2015-Present Randal S. Olson

@@ -344,10 +296,10 @@

After submitting your pull request - GitHub + GitHub - « Previous + « Previous Next » diff --git a/docs/css/theme_extra.css b/docs/css/theme_extra.css index b8b06d7f..e53d320a 100644 --- a/docs/css/theme_extra.css +++ b/docs/css/theme_extra.css @@ -22,10 +22,25 @@ * area doesn't scroll. * * https://github.com/mkdocs/mkdocs/pull/202 + * + * Builds upon pull 202 https://github.com/mkdocs/mkdocs/pull/202 + * to make toc scrollbar end before navigations buttons to not be overlapping. */ .wy-nav-side { - height: 100%; + height: calc(100% - 45px); overflow-y: auto; + min-height: 0; +} + +.rst-versions{ + border-top: 0; + height: 45px; +} + +@media screen and (max-width: 768px) { + .wy-nav-side { + height: 100%; + } } /* diff --git a/docs/examples/Boston_Example/index.html b/docs/examples/Boston_Example/index.html deleted file mode 100644 index ff10cbd6..00000000 --- a/docs/examples/Boston_Example/index.html +++ /dev/null @@ -1,266 +0,0 @@ - - - - - - - - - - - Boston Example - TPOT - - - - - - - - - - - - - - - - -
- - - - -
- - - - - -
-
-
- -
-
-
-
- -

The following code illustrates the usage of TPOT with the Boston house prices data set.

-
from tpot import TPOTRegressor
-from sklearn.datasets import load_boston
-from sklearn.model_selection import train_test_split
-
-digits = load_boston()
-X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
-                                                    train_size=0.75, test_size=0.25)
-
-tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
-tpot.fit(X_train, y_train)
-print(tpot.score(X_test, y_test))
-tpot.export('tpot_boston_pipeline.py')
-
- -

Running this code should discover a pipeline that achieves about 12.77 mean squared error (MSE).

-

For details on how the fit(), score() and export() functions work, see the usage documentation.

-

After running the above code, the corresponding Python code should be exported to the tpot_boston_pipeline.py file and look similar to the following:

-
import numpy as np
-
-from sklearn.model_selection import train_test_split
-from sklearn.ensemble import ExtraTreesRegressor
-from sklearn.pipeline import make_pipeline
-
-# NOTE: Make sure that the target is labeled 'class' in the data file
-tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
-features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),
-                     tpot_data.dtype.names.index('class'), axis=1)
-
-training_features, testing_features, training_classes, testing_classes = \
-    train_test_split(features, tpot_data['class'], random_state=42)
-
-exported_pipeline = make_pipeline(
-    ExtraTreesRegressor(max_features=0.76, n_estimators=500)
-)
-
-exported_pipeline.fit(training_features, training_classes)
-results = exported_pipeline.predict(testing_features)
-
- -
-
- - -
-
- -
- -
- -
- - - GitHub - - - « Previous - - - Next » - - -
- - - - diff --git a/docs/examples/IRIS_Example/index.html b/docs/examples/IRIS_Example/index.html deleted file mode 100644 index 2013e590..00000000 --- a/docs/examples/IRIS_Example/index.html +++ /dev/null @@ -1,268 +0,0 @@ - - - - - - - - - - - IRIS Example - TPOT - - - - - - - - - - - - - - - - -
- - - - -
- - - - - -
-
-
- -
-
-
-
- -

The following code illustrates the usage of TPOT with the IRIS data set.

-
from tpot import TPOTClassifier
-from sklearn.datasets import load_iris
-from sklearn.model_selection import train_test_split
-import numpy as np
-
-iris = load_iris()
-X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
-    iris.target.astype(np.float64), train_size=0.75, test_size=0.25)
-
-tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
-tpot.fit(X_train, y_train)
-print(tpot.score(X_test, y_test))
-tpot.export('tpot_iris_pipeline.py')
-
- -

Running this code should discover a pipeline that achieves ~96% testing accuracy.

-

For details on how the fit(), score() and export() functions work, see the usage documentation.

-

After running the above code, the corresponding Python code should be exported to the tpot_iris_pipeline.py file and look similar to the following:

-
import numpy as np
-
-from sklearn.model_selection import train_test_split
-from sklearn.ensemble import VotingClassifier
-from sklearn.linear_model import LogisticRegression
-from sklearn.pipeline import make_pipeline, make_union
-from sklearn.preprocessing import FunctionTransformer, PolynomialFeatures
-
-# NOTE: Make sure that the class is labeled 'class' in the data file
-tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
-features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
-training_features, testing_features, training_classes, testing_classes = \
-    train_test_split(features, tpot_data['class'], random_state=42)
-
-exported_pipeline = make_pipeline(
-    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
-    LogisticRegression(C=0.9, dual=False, penalty="l2")
-)
-
-exported_pipeline.fit(training_features, training_classes)
-results = exported_pipeline.predict(testing_features)
-
- -
-
- - -
-
- -
- -
- -
- - - GitHub - - - « Previous - - - Next » - - -
- - - - diff --git a/docs/examples/MNIST_Example/index.html b/docs/examples/MNIST_Example/index.html deleted file mode 100644 index ed55ee69..00000000 --- a/docs/examples/MNIST_Example/index.html +++ /dev/null @@ -1,263 +0,0 @@ - - - - - - - - - - - MNIST Example - TPOT - - - - - - - - - - - - - - - - -
- - - - -
- - - - - -
-
-
- -
-
-
-
- -

Below is a minimal working example with the practice MNIST data set.

-
from tpot import TPOTClassifier
-from sklearn.datasets import load_digits
-from sklearn.model_selection import train_test_split
-
-digits = load_digits()
-X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
-                                                    train_size=0.75, test_size=0.25)
-
-tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
-tpot.fit(X_train, y_train)
-print(tpot.score(X_test, y_test))
-tpot.export('tpot_mnist_pipeline.py')
-
- -

For details on how the fit(), score() and export() functions work, see the usage documentation.

-

Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_mnist_pipeline.py file and look similar to the following:

-
import numpy as np
-
-from sklearn.model_selection import train_test_split
-from sklearn.neighbors import KNeighborsClassifier
-from sklearn.pipeline import make_pipeline
-
-# NOTE: Make sure that the class is labeled 'class' in the data file
-tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR')
-features = tpot_data.view((np.float64, len(tpot_data.dtype.names)))
-features = np.delete(features, tpot_data.dtype.names.index('class'), axis=1)
-training_features, testing_features, training_classes, testing_classes =     train_test_split(features, tpot_data['class'], random_state=42)
-
-exported_pipeline = make_pipeline(
-    KNeighborsClassifier(n_neighbors=3, weights="uniform")
-)
-
-exported_pipeline.fit(training_features, training_classes)
-results = exported_pipeline.predict(testing_features)
-
- -
-
- - -
-
- -
- -
- -
- - - GitHub - - - « Previous - - - Next » - - -
- - - - diff --git a/docs/examples/Titanic_Kaggle_Example/index.html b/docs/examples/Titanic_Kaggle_Example/index.html deleted file mode 100644 index 5a2c2c88..00000000 --- a/docs/examples/Titanic_Kaggle_Example/index.html +++ /dev/null @@ -1,228 +0,0 @@ - - - - - - - - - - - Titanic Kaggle Example - TPOT - - - - - - - - - - - - - - - - -
- - - - -
- - - - - -
-
-
- -
-
-
-
- -

To see the TPOT applied the Titanic Kaggle dataset, see the Jupyter notebook here.

- -
-
- - -
-
- -
- -
- -
- - - GitHub - - - « Previous - - - Next » - - -
- - - - diff --git a/docs/examples/index.html b/docs/examples/index.html new file mode 100644 index 00000000..17ab8308 --- /dev/null +++ b/docs/examples/index.html @@ -0,0 +1,310 @@ + + + + + + + + + + + Examples - TPOT + + + + + + + + + + + + + + + + +
+ + + + +
+ + + + + +
+
+
+ +
+
+
+
+ +

Iris flower classification

+

The following code illustrates the usage of TPOT with the Iris data set, which is a simple supervised classification problem.

+
from tpot import TPOTClassifier
+from sklearn.datasets import load_iris
+from sklearn.model_selection import train_test_split
+import numpy as np
+
+iris = load_iris()
+X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
+    iris.target.astype(np.float64), train_size=0.75, test_size=0.25)
+
+tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
+tpot.fit(X_train, y_train)
+print(tpot.score(X_test, y_test))
+tpot.export('tpot_iris_pipeline.py')
+
+ +

Running this code should discover a pipeline that achieves about 97% testing accuracy.

+

For details on how the fit(), score() and export() functions work, see the usage documentation.

+

After running the above code, the corresponding Python code should be exported to the tpot_iris_pipeline.py file and look similar to the following:

+
import numpy as np
+
+from sklearn.model_selection import train_test_split
+from sklearn.naive_bayes import GaussianNB
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import Normalizer
+
+# NOTE: Make sure that the class is labeled 'class' in the data file
+tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
+features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),
+                     tpot_data.dtype.names.index('class'), axis=1)
+training_features, testing_features, training_classes, testing_classes = \
+    train_test_split(features, tpot_data['class'], random_state=42)
+
+exported_pipeline = make_pipeline(
+    Normalizer(),
+    GaussianNB()
+)
+
+exported_pipeline.fit(training_features, training_classes)
+results = exported_pipeline.predict(testing_features)
+
+ +

MNIST digit recognition

+

Below is a minimal working example with the practice MNIST data set, which is an image classification problem.

+
from tpot import TPOTClassifier
+from sklearn.datasets import load_digits
+from sklearn.model_selection import train_test_split
+
+digits = load_digits()
+X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
+                                                    train_size=0.75, test_size=0.25)
+
+tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
+tpot.fit(X_train, y_train)
+print(tpot.score(X_test, y_test))
+tpot.export('tpot_mnist_pipeline.py')
+
+ +

For details on how the fit(), score() and export() functions work, see the usage documentation.

+

Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_mnist_pipeline.py file and look similar to the following:

+
import numpy as np
+
+from sklearn.model_selection import train_test_split
+from sklearn.neighbors import KNeighborsClassifier
+
+# NOTE: Make sure that the class is labeled 'class' in the data file
+tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
+features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),
+                     tpot_data.dtype.names.index('class'), axis=1)
+training_features, testing_features, training_classes, testing_classes = \
+    train_test_split(features, tpot_data['class'], random_state=42)
+
+exported_pipeline = KNeighborsClassifier(n_neighbors=6, weights="distance")
+
+exported_pipeline.fit(training_features, training_classes)
+results = exported_pipeline.predict(testing_features)
+
+ +

Boston housing prices modeling

+

The following code illustrates the usage of TPOT with the Boston housing prices data set, which is a regression problem.

+
from tpot import TPOTRegressor
+from sklearn.datasets import load_boston
+from sklearn.model_selection import train_test_split
+
+digits = load_boston()
+X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
+                                                    train_size=0.75, test_size=0.25)
+
+tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)
+tpot.fit(X_train, y_train)
+print(tpot.score(X_test, y_test))
+tpot.export('tpot_boston_pipeline.py')
+
+ +

Running this code should discover a pipeline that achieves at least 10 mean squared error (MSE) on the test set.

+

For details on how the fit(), score() and export() functions work, see the usage documentation.

+

After running the above code, the corresponding Python code should be exported to the tpot_boston_pipeline.py file and look similar to the following:

+
import numpy as np
+
+from sklearn.ensemble import GradientBoostingRegressor
+from sklearn.model_selection import train_test_split
+
+# NOTE: Make sure that the class is labeled 'class' in the data file
+tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
+features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),
+                     tpot_data.dtype.names.index('class'), axis=1)
+training_features, testing_features, training_classes, testing_classes = \
+    train_test_split(features, tpot_data['class'], random_state=42)
+
+exported_pipeline = GradientBoostingRegressor(alpha=0.85, learning_rate=0.1, loss="ls",
+                                              max_features=0.9, min_samples_leaf=5,
+                                              min_samples_split=6)
+
+exported_pipeline.fit(training_features, training_classes)
+results = exported_pipeline.predict(testing_features)
+
+ +

Titanic survival analysis

+

To see the TPOT applied the Titanic Kaggle dataset, see the Jupyter notebook here. This example shows how to take a messy dataset and preprocess it such that it can be used in scikit-learn and TPOT.

+ +
+
+ + +
+
+ +
+ +
+ +
+ + + GitHub + + + « Previous + + + Next » + + +
+ + + + diff --git a/docs/index.html b/docs/index.html index 71250fd7..cb1f68c0 100644 --- a/docs/index.html +++ b/docs/index.html @@ -45,95 +45,50 @@ @@ -213,7 +168,7 @@
-

Copyright © 2016-Present Randal S. Olson

+

Copyright © 2015-Present Randal S. Olson

@@ -230,7 +185,7 @@
- GitHub + GitHub @@ -244,6 +199,6 @@ diff --git a/docs/installing/index.html b/docs/installing/index.html index 49cad5b9..bfb34f12 100644 --- a/docs/installing/index.html +++ b/docs/installing/index.html @@ -45,95 +45,50 @@ @@ -200,6 +155,10 @@
pip install deap update_checker tqdm
 
+

For the Windows OS, the pywin32 module is required if the Python is NOT installed via Anaconda Python distribution and can be installed with pip via the command:

+
pip install pywin32
+
+

Optionally, install XGBoost if you would like TPOT to use XGBoost. XGBoost is entirely optional, and TPOT will still function normally without XGBoost if you do not have it installed.

pip install xgboost
 
@@ -230,7 +189,7 @@
-

Copyright © 2016-Present Randal S. Olson

+

Copyright © 2015-Present Randal S. Olson

@@ -247,7 +206,7 @@
- GitHub + GitHub « Previous diff --git a/docs/js/theme.js b/docs/js/theme.js index 6396162c..f1f0a588 100644 --- a/docs/js/theme.js +++ b/docs/js/theme.js @@ -1,5 +1,4 @@ $( document ).ready(function() { - // Shift nav in mobile when clicking the menu. $(document).on('click', "[data-toggle='wy-nav-top']", function() { $("[data-toggle='wy-nav-shift']").toggleClass("shift"); @@ -53,3 +52,31 @@ window.SphinxRtdTheme = (function (jquery) { StickyNav : stickyNav }; }($)); + +// The code below is a copy of @seanmadsen code posted Jan 10, 2017 on issue 803. +// https://github.com/mkdocs/mkdocs/issues/803 +// This just incorporates the auto scroll into the theme itself without +// the need for additional custom.js file. +// +$(function() { + $.fn.isFullyWithinViewport = function(){ + var viewport = {}; + viewport.top = $(window).scrollTop(); + viewport.bottom = viewport.top + $(window).height(); + var bounds = {}; + bounds.top = this.offset().top; + bounds.bottom = bounds.top + this.outerHeight(); + return ( ! ( + (bounds.top <= viewport.top) || + (bounds.bottom >= viewport.bottom) + ) ); + }; + if( !$('li.toctree-l1.current').isFullyWithinViewport() ) { + $('.wy-nav-side') + .scrollTop( + $('li.toctree-l1.current').offset().top - + $('.wy-nav-side').offset().top - + 60 + ); + } +}); diff --git a/docs/mkdocs/search_index.json b/docs/mkdocs/search_index.json index 40e50879..0f9d93cc 100644 --- a/docs/mkdocs/search_index.json +++ b/docs/mkdocs/search_index.json @@ -7,62 +7,72 @@ }, { "location": "/installing/", - "text": "TPOT is built on top of several existing Python libraries, including:\n\n\n\n\n\n\nNumPy\n\n\n\n\n\n\nSciPy\n\n\n\n\n\n\nscikit-learn\n\n\n\n\n\n\nDEAP\n\n\n\n\n\n\nupdate_checker\n\n\n\n\n\n\ntqdm\n\n\n\n\n\n\nMost of the necessary Python packages can be installed via the \nAnaconda Python distribution\n, which we strongly recommend that you use. We also strongly recommend that you use of Python 3 over Python 2 if you're given the choice.\n\n\nNumPy, SciPy, and scikit-learn can be installed in Anaconda via the command:\n\n\nconda install numpy scipy scikit-learn\n\n\n\n\nDEAP, update_checker, and tqdm (used for verbose TPOT runs) can be installed with \npip\n via the command:\n\n\npip install deap update_checker tqdm\n\n\n\n\nOptionally, install XGBoost if you would like TPOT to use XGBoost. XGBoost is entirely optional, and TPOT will still function normally without XGBoost if you do not have it installed.\n\n\npip install xgboost\n\n\n\n\nIf you have issues installing XGBoost, check the \nXGBoost installation documentation\n.\n\n\nFinally to install TPOT itself, run the following command:\n\n\npip install tpot\n\n\n\n\nPlease \nfile a new issue\n if you run into installation problems.", + "text": "TPOT is built on top of several existing Python libraries, including:\n\n\n\n\n\n\nNumPy\n\n\n\n\n\n\nSciPy\n\n\n\n\n\n\nscikit-learn\n\n\n\n\n\n\nDEAP\n\n\n\n\n\n\nupdate_checker\n\n\n\n\n\n\ntqdm\n\n\n\n\n\n\nMost of the necessary Python packages can be installed via the \nAnaconda Python distribution\n, which we strongly recommend that you use. We also strongly recommend that you use of Python 3 over Python 2 if you're given the choice.\n\n\nNumPy, SciPy, and scikit-learn can be installed in Anaconda via the command:\n\n\nconda install numpy scipy scikit-learn\n\n\n\n\nDEAP, update_checker, and tqdm (used for verbose TPOT runs) can be installed with \npip\n via the command:\n\n\npip install deap update_checker tqdm\n\n\n\n\nFor the Windows OS\n, the pywin32 module is required if the Python is NOT installed via \nAnaconda Python distribution\n and can be installed with \npip\n via the command:\n\n\npip install pywin32\n\n\n\n\nOptionally, install XGBoost if you would like TPOT to use XGBoost. XGBoost is entirely optional, and TPOT will still function normally without XGBoost if you do not have it installed.\n\n\npip install xgboost\n\n\n\n\nIf you have issues installing XGBoost, check the \nXGBoost installation documentation\n.\n\n\nFinally to install TPOT itself, run the following command:\n\n\npip install tpot\n\n\n\n\nPlease \nfile a new issue\n if you run into installation problems.", "title": "Installation" }, { "location": "/using/", - "text": "TPOT on the command line\n\n\nTo use TPOT via the command line, enter the following command with a path to the data file:\n\n\ntpot /path_to/data_file.csv\n\n\n\n\nTPOT offers several arguments that can be provided at the command line:\n\n\n\n\n\n\nArgument\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\n-is\n\n\nINPUT_SEPARATOR\n\n\nAny string\n\n\nCharacter used to separate columns in the input file.\n\n\n\n\n\n\n-target\n\n\nTARGET_NAME\n\n\nAny string\n\n\nName of the target column in the input file.\n\n\n\n\n\n\n-mode\n\n\nTPOT_MODE\n\n\n['classification', 'regression']\n\n\nWhether TPOT is being used for a classification or regression problem.\n\n\n\n\n\n\n-o\n\n\nOUTPUT_FILE\n\n\nString path to a file\n\n\nFile to export the code for the final optimized pipeline.\n\n\n\n\n\n\n-g\n\n\nGENERATIONS\n\n\nAny positive integer\n\n\nNumber of generations to run pipeline optimization over. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize over. TPOT will evaluate GENERATIONS x POPULATION_SIZE number of pipelines in total.\n\n\n\n\n\n\n-p\n\n\nPOPULATION_SIZE\n\n\nAny positive integer\n\n\nNumber of individuals in the GP population. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize over. TPOT will evaluate GENERATIONS x POPULATION_SIZE number of pipelines in total.\n\n\n\n\n\n\n-mr\n\n\nMUTATION_RATE\n\n\n[0.0, 1.0]\n\n\nGP mutation rate. We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\n-xr\n\n\nCROSSOVER_RATE\n\n\n[0.0, 1.0]\n\n\nGP crossover rate in the range [0.0, 1.0]. We recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\n\n\n-cv\n\n\nNUM_CV_FOLDS\n\n\nAny integer >2\n\n\nThe number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process.\n\n\n\n\n\n\n-scoring\n\n\nSCORING_FN\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, balanced accuracy is used for classification and mean squared error is used for regression. TPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. See the section on \nscoring functions\n for more details.\n\n\n\n\n\n\n-maxtime\n\n\nMAX_TIME_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline. This setting will override the GENERATIONS parameter and allow TPOT to run until it runs out of time.\n\n\n\n\n\n\n-maxeval\n\n\nMAX_EVAL_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize a single pipeline. Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer.\n\n\n\n\n\n\n-s\n\n\nRANDOM_STATE\n\n\nAny positive integer\n\n\nRandom number generator seed for reproducibility. Set this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.\n\n\n\n\n\n\n-v\n\n\nVERBOSITY\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it is running: 0 = none, 1 = minimal, 2 = all. A setting of 2 or higher will add a progress bar during the optimization procedure.\n\n\n\n\n\n\n--no-update-check\n\n\nN/A\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n--version\n\n\nN/A\n\n\nShow TPOT's version number and exit.\n\n\n\n\n\n\n--help\n\n\nN/A\n\n\nShow TPOT's help documentation and exit.\n\n\n\n\n\n\n\nAn example command-line call to TPOT may look like:\n\n\ntpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2\n\n\n\n\nTPOT with code\n\n\nWe've taken care to design the TPOT interface to be as similar as possible to scikit-learn.\n\n\nTPOT can be imported just like any regular Python module. To import TPOT, type:\n\n\nfrom tpot import TPOTClassifier\n\n\n\n\nthen create an instance of TPOT as follows:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier()\n\n\n\n\nIt's also possible to use TPOT for regression problems with the \nTPOTRegressor\n class. Other than the class name, a \nTPOTRegressor\n is used the same way as a \nTPOTClassifier\n.\n\n\nNote that you can pass several parameters to the TPOT instantiation call:\n\n\n\n\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\ngeneration\n\n\nAny positive integer\n\n\nThe number of generations to run pipeline optimization over. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total.\n\n\n\n\n\n\npopulation_size\n\n\nAny positive integer\n\n\nThe number of individuals in the GP population. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total.\n\n\n\n\n\n\nmutation_rate\n\n\n[0.0, 1.0]\n\n\nThe mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing.\n\n\n\n\n\n\ncrossover_rate\n\n\n[0.0, 1.0]\n\n\nThe crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to \"breed\" every generation. We don't recommend that you tweak this parameter unless you know what you're doing.\n\n\n\n\n\n\nnum_cv_folds\n\n\n[2, 10]\n\n\nThe number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process.\n\n\n\n\n\n\nscoring\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature \nscorer(y_true, y_pred)\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, balanced accuracy is used for classification and mean squared error is used for regression. TPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. See the section on \nscoring functions\n for more details.\n\n\n\n\n\n\nmax_time_mins\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline. This setting will override the generations parameter.\n\n\n\n\n\n\nmax_eval_time_mins\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize a single pipeline. Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer.\n\n\n\n\n\n\nrandom_state\n\n\nAny positive integer\n\n\nThe random number generator seed for TPOT. Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.\n\n\n\n\n\n\nverbosity\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it's running. 0 = none, 1 = minimal, 2 = high, 3 = all. A setting of 2 or higher will add a progress bar to calls to fit().\n\n\n\n\n\n\ndisable_update_check\n\n\n[True, False]\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n\nSome example code with custom TPOT parameters might look like:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)\n\n\n\n\nNow TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the \nfit\n function:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\n\n\n\n\nThe \nfit()\n function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then initializes the genetic programming algoritm to find the best pipeline based on average k-fold score.\n\n\nYou can then proceed to evaluate the final pipeline on the testing set with the \nscore()\n function:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes))\n\n\n\n\nFinally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the \nexport()\n function:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes))\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nOnce this code finishes running, \ntpot_exported_pipeline.py\n will contain the Python code for the optimized pipeline.\n\n\nCheck our \nexamples\n to see TPOT applied to some specific data sets.\n\n\n\n\nScoring functions\n\n\nTPOT makes use of \nsklearn.model_selection.cross_val_score\n, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:\n\n\n\n\n\n\nYou can pass in a string from the list described in the table above. Any other strings will cause internal issues that may break your code down the line.\n\n\n\n\n\n\nYou can pass in a function with the signature \nscorer(y_true, y_pred)\n, where \ny_true\n are the true target values and \ny_pred\n are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.\n\n\n\n\n\n\ndef accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)", + "text": "TPOT on the command line\n\n\nTo use TPOT via the command line, enter the following command with a path to the data file:\n\n\ntpot /path_to/data_file.csv\n\n\n\n\nTPOT offers several arguments that can be provided at the command line:\n\n\n\n\n\n\nArgument\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\n-is\n\n\nINPUT_SEPARATOR\n\n\nAny string\n\n\nCharacter used to separate columns in the input file.\n\n\n\n\n\n\n-target\n\n\nTARGET_NAME\n\n\nAny string\n\n\nName of the target column in the input file.\n\n\n\n\n\n\n-mode\n\n\nTPOT_MODE\n\n\n['classification', 'regression']\n\n\nWhether TPOT is being used for a supervised classification or regression problem.\n\n\n\n\n\n\n-o\n\n\nOUTPUT_FILE\n\n\nString path to a file\n\n\nFile to export the code for the final optimized pipeline.\n\n\n\n\n\n\n-g\n\n\nGENERATIONS\n\n\nAny positive integer\n\n\nNumber of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-p\n\n\nPOPULATION_SIZE\n\n\nAny positive integer\n\n\nNumber of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-os\n\n\nOFFSPRING_SIZE\n\n\nAny positive integer\n\n\nNumber of offspring to produce in each GP generation.\n\n\nBy default, OFFSPRING_SIZE = POPULATION_SIZE.\n\n\n\n\n\n\n-mr\n\n\nMUTATION_RATE\n\n\n[0.0, 1.0]\n\n\nGP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\n-xr\n\n\nCROSSOVER_RATE\n\n\n[0.0, 1.0]\n\n\nGP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\n\n\n-scoring\n\n\nSCORING_FN\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',\n'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized.\n\n\nSee the section on \nscoring functions\n for more details.\n\n\n\n\n\n\n-cv\n\n\nNUM_CV_FOLDS\n\n\nAny integer >1\n\n\nNumber of folds to evaluate each pipeline over in 'k-fold cross-validation during the TPOT optimization process.\n\n\n\n\n\n\n-njobs\n\n\nNUM_JOBS\n\n\nAny positive integer or -1\n\n\nNumber of CPUs for evaluating pipelines in parallel during the TPOT optimization process.\n\n\nAssigning this to -1 will use as many cores as available on the computer.\n\n\n\n\n\n\n-maxtime\n\n\nMAX_TIME_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time.\n\n\n\n\n\n\n-maxeval\n\n\nMAX_EVAL_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer.\n\n\n\n\n\n\n-s\n\n\nRANDOM_STATE\n\n\nAny positive integer\n\n\nRandom number generator seed for reproducibility.\n\n\nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.\n\n\n\n\n\n\n-config\n\n\nCONFIG_FILE\n\n\nString path to a file\n\n\nConfiguration file for customizing the operators and parameters that TPOT uses in the optimization process.\n\n\nSee the \ncustom configuration\n section for more information and examples.\n\n\n\n\n\n\n-v\n\n\nVERBOSITY\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it is running.\n\n\n0 = none, 1 = minimal, 2 = high, 3 = all.\n\n\nA setting of 2 or higher will add a progress bar during the optimization procedure.\n\n\n\n\n\n\n--no-update-check\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n--version\n\n\nShow TPOT's version number and exit.\n\n\n\n\n\n\n--help\n\n\nShow TPOT's help documentation and exit.\n\n\n\n\n\n\n\nAn example command-line call to TPOT may look like:\n\n\ntpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2\n\n\n\n\nTPOT with code\n\n\nWe've taken care to design the TPOT interface to be as similar as possible to scikit-learn.\n\n\nTPOT can be imported just like any regular Python module. To import TPOT, type:\n\n\nfrom tpot import TPOTClassifier\n\n\n\n\nthen create an instance of TPOT as follows:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier()\n\n\n\n\nIt's also possible to use TPOT for regression problems with the \nTPOTRegressor\n class. Other than the class name, a \nTPOTRegressor\n is used the same way as a \nTPOTClassifier\n.\n\n\nNote that you can pass several parameters to the TPOT instantiation call:\n\n\n\n\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\ngenerations\n\n\nAny positive integer\n\n\nNumber of iterations to the run pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\npopulation_size\n\n\nAny positive integer\n\n\nNumber of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\noffspring_size\n\n\nAny positive integer\n\n\nNumber of offspring to produce in each GP generation.\n\n\nBy default, offspring_size = population_size.\n\n\n\n\n\n\nmutation_rate\n\n\n[0.0, 1.0]\n\n\nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\ncrossover_rate\n\n\n[0.0, 1.0]\n\n\nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\nscoring\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',\n'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature \nscorer(y_true, y_pred)\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized.\n\n\nSee the section on \nscoring functions\n for more details.\n\n\n\n\n\n\ncv\n\n\nAny integer >1\n\n\nNumber of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.\n\n\n\n\n\n\nn_jobs\n\n\nAny positive integer or -1\n\n\nNumber of CPUs for evaluating pipelines in parallel during the TPOT optimization process.\n\n\nAssigning this to -1 will use as many cores as available on the computer.\n\n\n\n\n\n\nmax_time_mins\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time.\n\n\n\n\n\n\nmax_eval_time_mins\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to explore more complex pipelines, but will also allow TPOT to run longer.\n\n\n\n\n\n\nrandom_state\n\n\nAny positive integer\n\n\nRandom number generator seed for TPOT.\n\n\nUse this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.\n\n\n\n\n\n\nconfig_dict\n\n\nPython dictionary\n\n\nConfiguration dictionary for customizing the operators and parameters that TPOT uses in the optimization process.\n\n\nSee the \ncustom configuration\n section for more information and examples.\n\n\n\n\n\n\n\n\n\nwarm_start\n\n\n[True, False]\n\n\nFlag indicating whether the TPOT instance will reuse the population from previous calls to fit().\n\n\n\n\n\n\nverbosity\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it's running.\n\n\n0 = none, 1 = minimal, 2 = high, 3 = all.\n\n\nA setting of 2 or higher will add a progress bar during the optimization procedure.\n\n\n\n\n\n\ndisable_update_check\n\n\n[True, False]\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n\nSome example code with custom TPOT parameters might look like:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\n\n\n\n\nNow TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the \nfit\n function:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\n\n\n\n\nThe \nfit()\n function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then initializes the genetic programming algoritm to find the best pipeline based on average k-fold score.\n\n\nYou can then proceed to evaluate the final pipeline on the testing set with the \nscore()\n function:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes))\n\n\n\n\nFinally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the \nexport()\n function:\n\n\nfrom tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes))\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nOnce this code finishes running, \ntpot_exported_pipeline.py\n will contain the Python code for the optimized pipeline.\n\n\nCheck our \nexamples\n to see TPOT applied to some specific data sets.\n\n\n\n\nScoring functions\n\n\nTPOT makes use of \nsklearn.model_selection.cross_val_score\n for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:\n\n\n\n\n\n\nYou can pass in a string to the \nscoring\n parameter from the list above. Any other strings will cause TPOT to throw an exception.\n\n\n\n\n\n\nYou can pass a function with the signature \nscorer(y_true, y_pred)\n, where \ny_true\n are the true target values and \ny_pred\n are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.\n\n\n\n\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ndef accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=accuracy)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\n\n\nCustomizing TPOT's operators and parameters\n\n\nTPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. However, in some cases it is useful to limit the algorithms and parameters that TPOT explores. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.\n\n\nThe custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., \nsklearn.naive_bayes.MultinomialNB\n) and the second level key is the corresponding parameter name for that operator (e.g., \nfit_prior\n). The second level key should point to a list of parameter values for that parameter, e.g., \n'fit_prior': [True, False]\n.\n\n\nFor a simple example, the configuration could be:\n\n\nclassifier_config_dict = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\n\n\n\nin which case TPOT would only explore pipelines containing \nGaussianNB\n, \nBernoulliNB\n, \nMultinomialNB\n, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the \nTPOTClassifier\n/\nTPOTRegressor\n \nconfig_dict\n parameter, described above. For example:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\nclassifier_config_dict = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict=classifier_config_dict)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nCommand-line users must create a separate \n.py\n file with the custom configuration and provide the path to the file to the \ntpot\n call. For example, if the simple example configuration above is saved in \ntpot_classifier_config.py\n, that configuration could be used on the command line with the command:\n\n\ntpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py\n\n\n\n\nFor more detailed examples of how to customize TPOT's operator configuration, see the default configurations for \nclassification\n and \nregression\n in TPOT's source code.\n\n\nNote that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it explores.", "title": "Using TPOT" }, { "location": "/using/#tpot-on-the-command-line", - "text": "To use TPOT via the command line, enter the following command with a path to the data file: tpot /path_to/data_file.csv TPOT offers several arguments that can be provided at the command line: Argument Parameter Valid values Effect -is INPUT_SEPARATOR Any string Character used to separate columns in the input file. -target TARGET_NAME Any string Name of the target column in the input file. -mode TPOT_MODE ['classification', 'regression'] Whether TPOT is being used for a classification or regression problem. -o OUTPUT_FILE String path to a file File to export the code for the final optimized pipeline. -g GENERATIONS Any positive integer Number of generations to run pipeline optimization over. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize over. TPOT will evaluate GENERATIONS x POPULATION_SIZE number of pipelines in total. -p POPULATION_SIZE Any positive integer Number of individuals in the GP population. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize over. TPOT will evaluate GENERATIONS x POPULATION_SIZE number of pipelines in total. -mr MUTATION_RATE [0.0, 1.0] GP mutation rate. We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. -xr CROSSOVER_RATE [0.0, 1.0] GP crossover rate in the range [0.0, 1.0]. We recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. -cv NUM_CV_FOLDS Any integer >2 The number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process. -scoring SCORING_FN 'accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' Function used to evaluate the quality of a given pipeline for the problem. By default, balanced accuracy is used for classification and mean squared error is used for regression. TPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. See the section on scoring functions for more details. -maxtime MAX_TIME_MINS Any positive integer How many minutes TPOT has to optimize the pipeline. This setting will override the GENERATIONS parameter and allow TPOT to run until it runs out of time. -maxeval MAX_EVAL_MINS Any positive integer How many minutes TPOT has to optimize a single pipeline. Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer. -s RANDOM_STATE Any positive integer Random number generator seed for reproducibility. Set this seed if you want your TPOT run to be reproducible with the same seed and data set in the future. -v VERBOSITY {0, 1, 2, 3} How much information TPOT communicates while it is running: 0 = none, 1 = minimal, 2 = all. A setting of 2 or higher will add a progress bar during the optimization procedure. --no-update-check N/A Flag indicating whether the TPOT version checker should be disabled. --version N/A Show TPOT's version number and exit. --help N/A Show TPOT's help documentation and exit. An example command-line call to TPOT may look like: tpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2", + "text": "To use TPOT via the command line, enter the following command with a path to the data file: tpot /path_to/data_file.csv TPOT offers several arguments that can be provided at the command line: Argument Parameter Valid values Effect -is INPUT_SEPARATOR Any string Character used to separate columns in the input file. -target TARGET_NAME Any string Name of the target column in the input file. -mode TPOT_MODE ['classification', 'regression'] Whether TPOT is being used for a supervised classification or regression problem. -o OUTPUT_FILE String path to a file File to export the code for the final optimized pipeline. -g GENERATIONS Any positive integer Number of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. -p POPULATION_SIZE Any positive integer Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. -os OFFSPRING_SIZE Any positive integer Number of offspring to produce in each GP generation. \nBy default, OFFSPRING_SIZE = POPULATION_SIZE. -mr MUTATION_RATE [0.0, 1.0] GP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. -xr CROSSOVER_RATE [0.0, 1.0] GP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation. \nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. -scoring SCORING_FN 'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression. \nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. \nSee the section on scoring functions for more details. -cv NUM_CV_FOLDS Any integer >1 Number of folds to evaluate each pipeline over in 'k-fold cross-validation during the TPOT optimization process. -njobs NUM_JOBS Any positive integer or -1 Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process. \nAssigning this to -1 will use as many cores as available on the computer. -maxtime MAX_TIME_MINS Any positive integer How many minutes TPOT has to optimize the pipeline. \nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time. -maxeval MAX_EVAL_MINS Any positive integer How many minutes TPOT has to evaluate a single pipeline. \nSetting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer. -s RANDOM_STATE Any positive integer Random number generator seed for reproducibility. \nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future. -config CONFIG_FILE String path to a file Configuration file for customizing the operators and parameters that TPOT uses in the optimization process. \nSee the custom configuration section for more information and examples. -v VERBOSITY {0, 1, 2, 3} How much information TPOT communicates while it is running. \n0 = none, 1 = minimal, 2 = high, 3 = all. \nA setting of 2 or higher will add a progress bar during the optimization procedure. --no-update-check Flag indicating whether the TPOT version checker should be disabled. --version Show TPOT's version number and exit. --help Show TPOT's help documentation and exit. An example command-line call to TPOT may look like: tpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2", "title": "TPOT on the command line" }, { "location": "/using/#tpot-with-code", - "text": "We've taken care to design the TPOT interface to be as similar as possible to scikit-learn. TPOT can be imported just like any regular Python module. To import TPOT, type: from tpot import TPOTClassifier then create an instance of TPOT as follows: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier() It's also possible to use TPOT for regression problems with the TPOTRegressor class. Other than the class name, a TPOTRegressor is used the same way as a TPOTClassifier . Note that you can pass several parameters to the TPOT instantiation call: Parameter Valid values Effect generation Any positive integer The number of generations to run pipeline optimization over. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total. population_size Any positive integer The number of individuals in the GP population. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total. mutation_rate [0.0, 1.0] The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing. crossover_rate [0.0, 1.0] The crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to \"breed\" every generation. We don't recommend that you tweak this parameter unless you know what you're doing. num_cv_folds [2, 10] The number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process. scoring 'accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature scorer(y_true, y_pred) Function used to evaluate the quality of a given pipeline for the problem. By default, balanced accuracy is used for classification and mean squared error is used for regression. TPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. See the section on scoring functions for more details. max_time_mins Any positive integer How many minutes TPOT has to optimize the pipeline. This setting will override the generations parameter. max_eval_time_mins Any positive integer How many minutes TPOT has to optimize a single pipeline. Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer. random_state Any positive integer The random number generator seed for TPOT. Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. verbosity {0, 1, 2, 3} How much information TPOT communicates while it's running. 0 = none, 1 = minimal, 2 = high, 3 = all. A setting of 2 or higher will add a progress bar to calls to fit(). disable_update_check [True, False] Flag indicating whether the TPOT version checker should be disabled. Some example code with custom TPOT parameters might look like: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2) Now TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the fit function: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes) The fit() function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then initializes the genetic programming algoritm to find the best pipeline based on average k-fold score. You can then proceed to evaluate the final pipeline on the testing set with the score() function: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes)) Finally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the export() function: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes))\npipeline_optimizer.export('tpot_exported_pipeline.py') Once this code finishes running, tpot_exported_pipeline.py will contain the Python code for the optimized pipeline. Check our examples to see TPOT applied to some specific data sets.", + "text": "We've taken care to design the TPOT interface to be as similar as possible to scikit-learn. TPOT can be imported just like any regular Python module. To import TPOT, type: from tpot import TPOTClassifier then create an instance of TPOT as follows: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier() It's also possible to use TPOT for regression problems with the TPOTRegressor class. Other than the class name, a TPOTRegressor is used the same way as a TPOTClassifier . Note that you can pass several parameters to the TPOT instantiation call: Parameter Valid values Effect generations Any positive integer Number of iterations to the run pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. population_size Any positive integer Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. offspring_size Any positive integer Number of offspring to produce in each GP generation. \nBy default, offspring_size = population_size. mutation_rate [0.0, 1.0] Mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. crossover_rate [0.0, 1.0] Crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. scoring 'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature scorer(y_true, y_pred) Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression. \nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. \nSee the section on scoring functions for more details. cv Any integer >1 Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process. n_jobs Any positive integer or -1 Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process. \nAssigning this to -1 will use as many cores as available on the computer. max_time_mins Any positive integer How many minutes TPOT has to optimize the pipeline. \nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time. max_eval_time_mins Any positive integer How many minutes TPOT has to optimize a single pipeline. \nSetting this parameter to higher values will allow TPOT to explore more complex pipelines, but will also allow TPOT to run longer. random_state Any positive integer Random number generator seed for TPOT. \nUse this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. config_dict Python dictionary Configuration dictionary for customizing the operators and parameters that TPOT uses in the optimization process. \nSee the custom configuration section for more information and examples. warm_start [True, False] Flag indicating whether the TPOT instance will reuse the population from previous calls to fit(). verbosity {0, 1, 2, 3} How much information TPOT communicates while it's running. \n0 = none, 1 = minimal, 2 = high, 3 = all. \nA setting of 2 or higher will add a progress bar during the optimization procedure. disable_update_check [True, False] Flag indicating whether the TPOT version checker should be disabled. Some example code with custom TPOT parameters might look like: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2) Now TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the fit function: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes) The fit() function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then initializes the genetic programming algoritm to find the best pipeline based on average k-fold score. You can then proceed to evaluate the final pipeline on the testing set with the score() function: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes)) Finally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the export() function: from tpot import TPOTClassifier\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(training_features, training_classes)\nprint(pipeline_optimizer.score(testing_features, testing_classes))\npipeline_optimizer.export('tpot_exported_pipeline.py') Once this code finishes running, tpot_exported_pipeline.py will contain the Python code for the optimized pipeline. Check our examples to see TPOT applied to some specific data sets.", "title": "TPOT with code" }, { "location": "/using/#scoring-functions", - "text": "TPOT makes use of sklearn.model_selection.cross_val_score , and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT: You can pass in a string from the list described in the table above. Any other strings will cause internal issues that may break your code down the line. You can pass in a function with the signature scorer(y_true, y_pred) , where y_true are the true target values and y_pred are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation. def accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)", + "text": "TPOT makes use of sklearn.model_selection.cross_val_score for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT: You can pass in a string to the scoring parameter from the list above. Any other strings will cause TPOT to throw an exception. You can pass a function with the signature scorer(y_true, y_pred) , where y_true are the true target values and y_pred are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation. from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ndef accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=accuracy)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')", "title": "Scoring functions" }, { - "location": "/examples/MNIST_Example/", - "text": "Below is a minimal working example with the practice MNIST data set.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nFor details on how the \nfit()\n, \nscore()\n and \nexport()\n functions work, see the \nusage documentation\n.\n\n\nRunning this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the \ntpot_mnist_pipeline.py\n file and look similar to the following:\n\n\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.pipeline import make_pipeline\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR')\nfeatures = tpot_data.view((np.float64, len(tpot_data.dtype.names)))\nfeatures = np.delete(features, tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = make_pipeline(\n KNeighborsClassifier(n_neighbors=3, weights=\"uniform\")\n)\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)", - "title": "MNIST Example" + "location": "/using/#customizing-tpots-operators-and-parameters", + "text": "TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. However, in some cases it is useful to limit the algorithms and parameters that TPOT explores. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters. The custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., sklearn.naive_bayes.MultinomialNB ) and the second level key is the corresponding parameter name for that operator (e.g., fit_prior ). The second level key should point to a list of parameter values for that parameter, e.g., 'fit_prior': [True, False] . For a simple example, the configuration could be: classifier_config_dict = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n} in which case TPOT would only explore pipelines containing GaussianNB , BernoulliNB , MultinomialNB , and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the TPOTClassifier / TPOTRegressor config_dict parameter, described above. For example: from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\nclassifier_config_dict = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict=classifier_config_dict)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py') Command-line users must create a separate .py file with the custom configuration and provide the path to the file to the tpot call. For example, if the simple example configuration above is saved in tpot_classifier_config.py , that configuration could be used on the command line with the command: tpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for classification and regression in TPOT's source code. Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it explores.", + "title": "Customizing TPOT's operators and parameters" }, { - "location": "/examples/IRIS_Example/", - "text": "The following code illustrates the usage of TPOT with the IRIS data set.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nimport numpy as np\n\niris = load_iris()\nX_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),\n iris.target.astype(np.float64), train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_iris_pipeline.py')\n\n\n\n\nRunning this code should discover a pipeline that achieves ~96% testing accuracy.\n\n\nFor details on how the \nfit()\n, \nscore()\n and \nexport()\n functions work, see the \nusage documentation\n.\n\n\nAfter running the above code, the corresponding Python code should be exported to the \ntpot_iris_pipeline.py\n file and look similar to the following:\n\n\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import VotingClassifier\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline, make_union\nfrom sklearn.preprocessing import FunctionTransformer, PolynomialFeatures\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = make_pipeline(\n PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),\n LogisticRegression(C=0.9, dual=False, penalty=\"l2\")\n)\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)", - "title": "IRIS Example" + "location": "/examples/", + "text": "Iris flower classification\n\n\nThe following code illustrates the usage of TPOT with the Iris data set, which is a simple supervised classification problem.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nimport numpy as np\n\niris = load_iris()\nX_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),\n iris.target.astype(np.float64), train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_iris_pipeline.py')\n\n\n\n\nRunning this code should discover a pipeline that achieves about 97% testing accuracy.\n\n\nFor details on how the \nfit()\n, \nscore()\n and \nexport()\n functions work, see the \nusage documentation\n.\n\n\nAfter running the above code, the corresponding Python code should be exported to the \ntpot_iris_pipeline.py\n file and look similar to the following:\n\n\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import Normalizer\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),\n tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = make_pipeline(\n Normalizer(),\n GaussianNB()\n)\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)\n\n\n\n\nMNIST digit recognition\n\n\nBelow is a minimal working example with the practice MNIST data set, which is an image classification problem.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nFor details on how the \nfit()\n, \nscore()\n and \nexport()\n functions work, see the \nusage documentation\n.\n\n\nRunning this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the \ntpot_mnist_pipeline.py\n file and look similar to the following:\n\n\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),\n tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = KNeighborsClassifier(n_neighbors=6, weights=\"distance\")\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)\n\n\n\n\nBoston housing prices modeling\n\n\nThe following code illustrates the usage of TPOT with the Boston housing prices data set, which is a regression problem.\n\n\nfrom tpot import TPOTRegressor\nfrom sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_boston()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_boston_pipeline.py')\n\n\n\n\nRunning this code should discover a pipeline that achieves at least 10 mean squared error (MSE) on the test set.\n\n\nFor details on how the \nfit()\n, \nscore()\n and \nexport()\n functions work, see the \nusage documentation\n.\n\n\nAfter running the above code, the corresponding Python code should be exported to the \ntpot_boston_pipeline.py\n file and look similar to the following:\n\n\nimport numpy as np\n\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.model_selection import train_test_split\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),\n tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = GradientBoostingRegressor(alpha=0.85, learning_rate=0.1, loss=\"ls\",\n max_features=0.9, min_samples_leaf=5,\n min_samples_split=6)\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)\n\n\n\n\nTitanic survival analysis\n\n\nTo see the TPOT applied the Titanic Kaggle dataset, see the Jupyter notebook \nhere\n. This example shows how to take a messy dataset and preprocess it such that it can be used in scikit-learn and TPOT.", + "title": "Examples" }, { - "location": "/examples/Boston_Example/", - "text": "The following code illustrates the usage of TPOT with the Boston house prices data set.\n\n\nfrom tpot import TPOTRegressor\nfrom sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_boston()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_boston_pipeline.py')\n\n\n\n\nRunning this code should discover a pipeline that achieves about 12.77 mean squared error (MSE).\n\n\nFor details on how the \nfit()\n, \nscore()\n and \nexport()\n functions work, see the \nusage documentation\n.\n\n\nAfter running the above code, the corresponding Python code should be exported to the \ntpot_boston_pipeline.py\n file and look similar to the following:\n\n\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import ExtraTreesRegressor\nfrom sklearn.pipeline import make_pipeline\n\n# NOTE: Make sure that the target is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),\n tpot_data.dtype.names.index('class'), axis=1)\n\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = make_pipeline(\n ExtraTreesRegressor(max_features=0.76, n_estimators=500)\n)\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)", - "title": "Boston Example" + "location": "/examples/#iris-flower-classification", + "text": "The following code illustrates the usage of TPOT with the Iris data set, which is a simple supervised classification problem. from tpot import TPOTClassifier\nfrom sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nimport numpy as np\n\niris = load_iris()\nX_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),\n iris.target.astype(np.float64), train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_iris_pipeline.py') Running this code should discover a pipeline that achieves about 97% testing accuracy. For details on how the fit() , score() and export() functions work, see the usage documentation . After running the above code, the corresponding Python code should be exported to the tpot_iris_pipeline.py file and look similar to the following: import numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import Normalizer\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),\n tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = make_pipeline(\n Normalizer(),\n GaussianNB()\n)\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)", + "title": "Iris flower classification" }, { - "location": "/examples/Titanic_Kaggle_Example/", - "text": "To see the TPOT applied the Titanic Kaggle dataset, see the Jupyter notebook \nhere\n.", - "title": "Titanic Kaggle Example" + "location": "/examples/#mnist-digit-recognition", + "text": "Below is a minimal working example with the practice MNIST data set, which is an image classification problem. from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py') For details on how the fit() , score() and export() functions work, see the usage documentation . Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_mnist_pipeline.py file and look similar to the following: import numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.neighbors import KNeighborsClassifier\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),\n tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = KNeighborsClassifier(n_neighbors=6, weights=\"distance\")\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)", + "title": "MNIST digit recognition" + }, + { + "location": "/examples/#boston-housing-prices-modeling", + "text": "The following code illustrates the usage of TPOT with the Boston housing prices data set, which is a regression problem. from tpot import TPOTRegressor\nfrom sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_boston()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_boston_pipeline.py') Running this code should discover a pipeline that achieves at least 10 mean squared error (MSE) on the test set. For details on how the fit() , score() and export() functions work, see the usage documentation . After running the above code, the corresponding Python code should be exported to the tpot_boston_pipeline.py file and look similar to the following: import numpy as np\n\nfrom sklearn.ensemble import GradientBoostingRegressor\nfrom sklearn.model_selection import train_test_split\n\n# NOTE: Make sure that the class is labeled 'class' in the data file\ntpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)\nfeatures = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),\n tpot_data.dtype.names.index('class'), axis=1)\ntraining_features, testing_features, training_classes, testing_classes = \\\n train_test_split(features, tpot_data['class'], random_state=42)\n\nexported_pipeline = GradientBoostingRegressor(alpha=0.85, learning_rate=0.1, loss=\"ls\",\n max_features=0.9, min_samples_leaf=5,\n min_samples_split=6)\n\nexported_pipeline.fit(training_features, training_classes)\nresults = exported_pipeline.predict(testing_features)", + "title": "Boston housing prices modeling" + }, + { + "location": "/examples/#titanic-survival-analysis", + "text": "To see the TPOT applied the Titanic Kaggle dataset, see the Jupyter notebook here . This example shows how to take a messy dataset and preprocess it such that it can be used in scikit-learn and TPOT.", + "title": "Titanic survival analysis" }, { "location": "/contributing/", - "text": "We welcome you to \ncheck the existing issues\n for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please \nfile a new issue\n so we can discuss it.\n\n\nProject layout\n\n\nThe latest stable release of TPOT is on the \nmaster branch\n, whereas the latest version of TPOT in development is on the \ndevelopment branch\n. Make sure you are looking at and working on the correct branch if you're looking to contribute code.\n\n\nIn terms of directory structure:\n\n\n\n\nAll of TPOT's code sources are in the \ntpot\n directory\n\n\nThe documentation sources are in the \ndocs\n directory\n\n\nImages in the documentation are in the \nimages\n directory\n\n\nTutorials for TPOT are in the \ntutorials\n directory\n\n\nUnit tests for TPOT are in the \ntests.py\n file\n\n\n\n\nMake sure to familiarize yourself with the project layout before making any major contributions, and especially make sure to send all code changes to the \ndevelopment\n branch.\n\n\nHow to contribute\n\n\nThe preferred way to contribute to TPOT is to fork the \n\nmain repository\n on\nGitHub:\n\n\n\n\n\n\nFork the \nproject repository\n:\n click on the 'Fork' button near the top of the page. This creates\n a copy of the code under your account on the GitHub server.\n\n\n\n\n\n\nClone this copy to your local disk:\n\n\n $ git clone git@github.com:YourLogin/tpot.git\n $ cd tpot\n\n\n\n\n\n\n\nCreate a branch to hold your changes:\n\n\n $ git checkout -b my-contribution\n\n\n\n\n\n\n\nMake sure your local environment is setup correctly for development. Installation instructions are almost identical to \nthe user instructions\n except that TPOT should \nnot\n be installed. If you have TPOT installed on your computer then make sure you are using a virtual environment that does not have TPOT installed. Furthermore, you should make sure you have installed the \nnose\n package into your development environment so that you can test changes locally.\n\n\n $ conda install nose\n\n\n\n\n\n\n\nStart making changes on your newly created branch, remembering to never work on the \nmaster\n branch! Work on this copy on your computer using Git to do the version control.\n\n\n\n\n\n\nOnce some changes are saved locally, you can use your tweaked version of TPOT by navigating to the project's base directory and running TPOT directly from the command line:\n\n\n $ python -m tpot.driver\n\n\n\nor by running script that imports and uses the TPOT module with code similar to \nfrom tpot import TPOTClassifier\n\n\n\n\n\n\nTo check your changes haven't broken any existing tests and to check new tests you've added pass run the following (note, you must have the \nnose\n package installed within your dev environment for this to work):\n\n\n $ nosetests -s -v\n\n\n\n\n\n\n\nWhen you're done editing and local testing, run:\n\n\n $ git add modified_files\n $ git commit\n\n\n\n\n\n\n\nto record your changes in Git, then push them to GitHub with:\n\n\n $ git push -u origin my-contribution\n\n\n\nFinally, go to the web page of your fork of the TPOT repo, and click 'Pull Request' (PR) to send your changes to the maintainers for review. Make sure that you send your PR to the \ndevelopment\n branch, as the \nmaster\n branch is reserved for the latest stable release. This will start the CI server to check all the project's unit tests run and send an email to the maintainers.\n\n\n(If any of the above seems like magic to you, then look up the \n\nGit documentation\n on the web.)\n\n\nBefore submitting your pull request\n\n\nBefore you submit a pull request for your contribution, please work through this checklist to make sure that you have done everything necessary so we can efficiently review and accept your changes.\n\n\nIf your contribution changes TPOT in any way:\n\n\n\n\n\n\nUpdate the \ndocumentation\n so all of your changes are reflected there.\n\n\n\n\n\n\nUpdate the \nREADME\n if anything there has changed.\n\n\n\n\n\n\nIf your contribution involves any code changes:\n\n\n\n\n\n\nUpdate the \nproject unit tests\n to test your code changes.\n\n\n\n\n\n\nMake sure that your code is properly commented with \ndocstrings\n and comments explaining your rationale behind non-obvious coding practices.\n\n\n\n\n\n\nIf your code affected any of the pipeline operators, make sure that the corresponding \nexport functionality\n reflects those changes.\n\n\n\n\n\n\nIf your contribution requires a new library dependency:\n\n\n\n\n\n\nDouble-check that the new dependency is easy to install via \npip\n or Anaconda and supports both Python 2 and 3. If the dependency requires a complicated installation, then we most likely won't merge your changes because we want to keep TPOT easy to install.\n\n\n\n\n\n\nAdd the required version of the library to \n.travis.yml\n\n\n\n\n\n\nAdd a line to pip install the library to \n.travis_install.sh\n\n\n\n\n\n\nAdd a line to print the version of the library to \n.travis_install.sh\n\n\n\n\n\n\nSimilarly add a line to print the version of the library to \n.travis_test.sh\n\n\n\n\n\n\nUpdating the documentation\n\n\nWe use \nmkdocs\n to manage our \ndocumentation\n. This allows us to write the docs in Markdown and compile them to HTML as needed. Below are a few useful commands to know when updating the documentation. Make sure that you are running them in the base documentation directory, \ndocs\n.\n\n\n\n\n\n\nmkdocs serve\n: Hosts of a local version of the documentation that you can access at the provided URL. The local version will update automatically as you save changes to the documentation.\n\n\n\n\n\n\nmkdocs build --clean\n: Creates a fresh build of the documentation in HTML. Always run this before deploying the documentation to GitHub.\n\n\n\n\n\n\nmkdocs gh-deploy\n: Deploys the documentation to GitHub. If you're deploying on your fork of TPOT, the online documentation should be accessible at \nhttp://.github.io/tpot/\n. Generally, you shouldn't need to run this command because you can view your changes with \nmkdocs serve\n.\n\n\n\n\n\n\nAfter submitting your pull request\n\n\nAfter submitting your pull request, \nTravis-CI\n will automatically run unit tests on your changes and make sure that your updated code builds and runs on Python 2 and 3. We also use services that automatically check code quality and test coverage.\n\n\nCheck back shortly after submitting your pull request to make sure that your code passes these checks. If any of the checks come back with a red X, then do your best to address the errors.", + "text": "We welcome you to \ncheck the existing issues\n for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please \nfile a new issue\n so we can discuss it.\n\n\nProject layout\n\n\nThe latest stable release of TPOT is on the \nmaster branch\n, whereas the latest version of TPOT in development is on the \ndevelopment branch\n. Make sure you are looking at and working on the correct branch if you're looking to contribute code.\n\n\nIn terms of directory structure:\n\n\n\n\nAll of TPOT's code sources are in the \ntpot\n directory\n\n\nThe documentation sources are in the \ndocs_sources\n directory\n\n\nImages in the documentation are in the \nimages\n directory\n\n\nTutorials for TPOT are in the \ntutorials\n directory\n\n\nUnit tests for TPOT are in the \ntests.py\n file\n\n\n\n\nMake sure to familiarize yourself with the project layout before making any major contributions, and especially make sure to send all code changes to the \ndevelopment\n branch.\n\n\nHow to contribute\n\n\nThe preferred way to contribute to TPOT is to fork the \n\nmain repository\n on\nGitHub:\n\n\n\n\n\n\nFork the \nproject repository\n:\n click on the 'Fork' button near the top of the page. This creates\n a copy of the code under your account on the GitHub server.\n\n\n\n\n\n\nClone this copy to your local disk:\n\n\n $ git clone git@github.com:YourUsername/tpot.git\n $ cd tpot\n\n\n\n\n\n\n\nCreate a branch to hold your changes:\n\n\n $ git checkout -b my-contribution\n\n\n\n\n\n\n\nMake sure your local environment is setup correctly for development. Installation instructions are almost identical to \nthe user instructions\n except that TPOT should \nnot\n be installed. If you have TPOT installed on your computer then make sure you are using a virtual environment that does not have TPOT installed. Furthermore, you should make sure you have installed the \nnose\n package into your development environment so that you can test changes locally.\n\n\n $ conda install nose\n\n\n\n\n\n\n\nStart making changes on your newly created branch, remembering to never work on the \nmaster\n branch! Work on this copy on your computer using Git to do the version control.\n\n\n\n\n\n\nOnce some changes are saved locally, you can use your tweaked version of TPOT by navigating to the project's base directory and running TPOT directly from the command line:\n\n\n $ python -m tpot.driver\n\n\n\nor by running script that imports and uses the TPOT module with code similar to \nfrom tpot import TPOTClassifier\n\n\n\n\n\n\nTo check your changes haven't broken any existing tests and to check new tests you've added pass run the following (note, you must have the \nnose\n package installed within your dev environment for this to work):\n\n\n $ nosetests -s -v\n\n\n\n\n\n\n\nWhen you're done editing and local testing, run:\n\n\n $ git add modified_files\n $ git commit\n\n\n\n\n\n\n\nto record your changes in Git, then push them to GitHub with:\n\n\n $ git push -u origin my-contribution\n\n\n\nFinally, go to the web page of your fork of the TPOT repo, and click 'Pull Request' (PR) to send your changes to the maintainers for review. Make sure that you send your PR to the \ndevelopment\n branch, as the \nmaster\n branch is reserved for the latest stable release. This will start the CI server to check all the project's unit tests run and send an email to the maintainers.\n\n\n(If any of the above seems like magic to you, then look up the \n\nGit documentation\n on the web.)\n\n\nBefore submitting your pull request\n\n\nBefore you submit a pull request for your contribution, please work through this checklist to make sure that you have done everything necessary so we can efficiently review and accept your changes.\n\n\nIf your contribution changes TPOT in any way:\n\n\n\n\n\n\nUpdate the \ndocumentation\n so all of your changes are reflected there.\n\n\n\n\n\n\nUpdate the \nREADME\n if anything there has changed.\n\n\n\n\n\n\nIf your contribution involves any code changes:\n\n\n\n\n\n\nUpdate the \nproject unit tests\n to test your code changes.\n\n\n\n\n\n\nMake sure that your code is properly commented with \ndocstrings\n and comments explaining your rationale behind non-obvious coding practices.\n\n\n\n\n\n\nIf your code affected any of the pipeline operators, make sure that the corresponding \nexport functionality\n reflects those changes.\n\n\n\n\n\n\nIf your contribution requires a new library dependency:\n\n\n\n\n\n\nDouble-check that the new dependency is easy to install via \npip\n or Anaconda and supports both Python 2 and 3. If the dependency requires a complicated installation, then we most likely won't merge your changes because we want to keep TPOT easy to install.\n\n\n\n\n\n\nAdd the required version of the library to \n.travis.yml\n\n\n\n\n\n\nAdd a line to pip install the library to \n.travis_install.sh\n\n\n\n\n\n\nAdd a line to print the version of the library to \n.travis_install.sh\n\n\n\n\n\n\nSimilarly add a line to print the version of the library to \n.travis_test.sh\n\n\n\n\n\n\nUpdating the documentation\n\n\nWe use \nmkdocs\n to manage our \nproject documentation\n. This allows us to write the documentation in Markdown and compile them to HTML as needed. Below are a couple useful commands to know when updating the documentation. Make sure that you are running these commands in the base directory of the TPOT project.\n\n\n\n\n\n\nmkdocs serve\n: Hosts of a local version of the documentation that you can access at the provided URL. The local version will update automatically as you save changes to the documentation.\n\n\n\n\n\n\nmkdocs build --clean\n: Creates a fresh build of the documentation in HTML in the \ndocs\n directory. Always run this before pushing the documentation to GitHub.\n\n\n\n\n\n\nAfter submitting your pull request\n\n\nAfter submitting your pull request, \nTravis-CI\n will automatically run unit tests on your changes and make sure that your updated code builds and runs on Python 2 and 3. We also use services that automatically check code quality and test coverage.\n\n\nCheck back shortly after submitting your pull request to make sure that your code passes these checks. If any of the checks come back with a red X, then do your best to address the errors.", "title": "Contributing" }, { "location": "/contributing/#project-layout", - "text": "The latest stable release of TPOT is on the master branch , whereas the latest version of TPOT in development is on the development branch . Make sure you are looking at and working on the correct branch if you're looking to contribute code. In terms of directory structure: All of TPOT's code sources are in the tpot directory The documentation sources are in the docs directory Images in the documentation are in the images directory Tutorials for TPOT are in the tutorials directory Unit tests for TPOT are in the tests.py file Make sure to familiarize yourself with the project layout before making any major contributions, and especially make sure to send all code changes to the development branch.", + "text": "The latest stable release of TPOT is on the master branch , whereas the latest version of TPOT in development is on the development branch . Make sure you are looking at and working on the correct branch if you're looking to contribute code. In terms of directory structure: All of TPOT's code sources are in the tpot directory The documentation sources are in the docs_sources directory Images in the documentation are in the images directory Tutorials for TPOT are in the tutorials directory Unit tests for TPOT are in the tests.py file Make sure to familiarize yourself with the project layout before making any major contributions, and especially make sure to send all code changes to the development branch.", "title": "Project layout" }, { "location": "/contributing/#how-to-contribute", - "text": "The preferred way to contribute to TPOT is to fork the main repository on\nGitHub: Fork the project repository :\n click on the 'Fork' button near the top of the page. This creates\n a copy of the code under your account on the GitHub server. Clone this copy to your local disk: $ git clone git@github.com:YourLogin/tpot.git\n $ cd tpot Create a branch to hold your changes: $ git checkout -b my-contribution Make sure your local environment is setup correctly for development. Installation instructions are almost identical to the user instructions except that TPOT should not be installed. If you have TPOT installed on your computer then make sure you are using a virtual environment that does not have TPOT installed. Furthermore, you should make sure you have installed the nose package into your development environment so that you can test changes locally. $ conda install nose Start making changes on your newly created branch, remembering to never work on the master branch! Work on this copy on your computer using Git to do the version control. Once some changes are saved locally, you can use your tweaked version of TPOT by navigating to the project's base directory and running TPOT directly from the command line: $ python -m tpot.driver or by running script that imports and uses the TPOT module with code similar to from tpot import TPOTClassifier To check your changes haven't broken any existing tests and to check new tests you've added pass run the following (note, you must have the nose package installed within your dev environment for this to work): $ nosetests -s -v When you're done editing and local testing, run: $ git add modified_files\n $ git commit to record your changes in Git, then push them to GitHub with: $ git push -u origin my-contribution Finally, go to the web page of your fork of the TPOT repo, and click 'Pull Request' (PR) to send your changes to the maintainers for review. Make sure that you send your PR to the development branch, as the master branch is reserved for the latest stable release. This will start the CI server to check all the project's unit tests run and send an email to the maintainers. (If any of the above seems like magic to you, then look up the Git documentation on the web.)", + "text": "The preferred way to contribute to TPOT is to fork the main repository on\nGitHub: Fork the project repository :\n click on the 'Fork' button near the top of the page. This creates\n a copy of the code under your account on the GitHub server. Clone this copy to your local disk: $ git clone git@github.com:YourUsername/tpot.git\n $ cd tpot Create a branch to hold your changes: $ git checkout -b my-contribution Make sure your local environment is setup correctly for development. Installation instructions are almost identical to the user instructions except that TPOT should not be installed. If you have TPOT installed on your computer then make sure you are using a virtual environment that does not have TPOT installed. Furthermore, you should make sure you have installed the nose package into your development environment so that you can test changes locally. $ conda install nose Start making changes on your newly created branch, remembering to never work on the master branch! Work on this copy on your computer using Git to do the version control. Once some changes are saved locally, you can use your tweaked version of TPOT by navigating to the project's base directory and running TPOT directly from the command line: $ python -m tpot.driver or by running script that imports and uses the TPOT module with code similar to from tpot import TPOTClassifier To check your changes haven't broken any existing tests and to check new tests you've added pass run the following (note, you must have the nose package installed within your dev environment for this to work): $ nosetests -s -v When you're done editing and local testing, run: $ git add modified_files\n $ git commit to record your changes in Git, then push them to GitHub with: $ git push -u origin my-contribution Finally, go to the web page of your fork of the TPOT repo, and click 'Pull Request' (PR) to send your changes to the maintainers for review. Make sure that you send your PR to the development branch, as the master branch is reserved for the latest stable release. This will start the CI server to check all the project's unit tests run and send an email to the maintainers. (If any of the above seems like magic to you, then look up the Git documentation on the web.)", "title": "How to contribute" }, { @@ -72,7 +82,7 @@ }, { "location": "/contributing/#updating-the-documentation", - "text": "We use mkdocs to manage our documentation . This allows us to write the docs in Markdown and compile them to HTML as needed. Below are a few useful commands to know when updating the documentation. Make sure that you are running them in the base documentation directory, docs . mkdocs serve : Hosts of a local version of the documentation that you can access at the provided URL. The local version will update automatically as you save changes to the documentation. mkdocs build --clean : Creates a fresh build of the documentation in HTML. Always run this before deploying the documentation to GitHub. mkdocs gh-deploy : Deploys the documentation to GitHub. If you're deploying on your fork of TPOT, the online documentation should be accessible at http://.github.io/tpot/ . Generally, you shouldn't need to run this command because you can view your changes with mkdocs serve .", + "text": "We use mkdocs to manage our project documentation . This allows us to write the documentation in Markdown and compile them to HTML as needed. Below are a couple useful commands to know when updating the documentation. Make sure that you are running these commands in the base directory of the TPOT project. mkdocs serve : Hosts of a local version of the documentation that you can access at the provided URL. The local version will update automatically as you save changes to the documentation. mkdocs build --clean : Creates a fresh build of the documentation in HTML in the docs directory. Always run this before pushing the documentation to GitHub.", "title": "Updating the documentation" }, { @@ -82,9 +92,14 @@ }, { "location": "/releases/", - "text": "Version 0.6\n\n\n\n\n\n\nTPOT now supports regression problems!\n We have created two separate \nTPOTClassifier\n and \nTPOTRegressor\n classes to support classification and regression problems, respectively. The \ncommand-line interface\n also supports this feature through the \n-mode\n parameter.\n\n\n\n\n\n\nTPOT now allows you to \nspecify a time limit\n for the optimization process with the \nmax_time_mins\n parameter, so you don't need to guess how long TPOT will take any more to recommend a pipeline to you.\n\n\n\n\n\n\nAdded a new operator that performs feature selection using \nExtraTrees\n feature importance scores.\n\n\n\n\n\n\nXGBoost\n has been added as an optional dependency to TPOT.\n If you have XGBoost installed, TPOT will automatically detect your installation and use the \nXGBoostClassifier\n and \nXGBoostRegressor\n in its pipelines.\n\n\n\n\n\n\nTPOT now offers a verbosity level of 3 (\"science mode\"), which outputs the entire Pareto front instead of only the current best score. This feature may be useful for users looking to make a trade-off between pipeline complexity and score.\n\n\n\n\n\n\nVersion 0.5\n\n\n\n\nMajor refactor: Each operator is defined in a separate class file. Hooray for easier-to-maintain code!\n\n\nTPOT now \nexports directly to scikit-learn Pipelines\n instead of hacky code.\n\n\nInternal representation of individuals now uses scikit-learn pipelines.\n\n\nParameters for each operator have been optimized so TPOT spends less time exploring useless parameters.\n\n\nWe have removed pandas as a dependency and instead use numpy matrices to store the data.\n\n\nTPOT now uses \nk-fold cross-validation\n when evaluating pipelines, with a default k = 3. This k parameter can be tuned when creating a new TPOT instance.\n\n\nImproved \nscoring function support\n: Even though TPOT uses balanced accuracy by default, you can now have TPOT use \nany of the scoring functions\n that \ncross_val_score\n supports.\n\n\nAdded the scikit-learn \nNormalizer\n preprocessor.\n\n\nMinor text fixes.\n\n\n\n\nVersion 0.4\n\n\nIn TPOT 0.4, we've made some major changes to the internals of TPOT and added some convenience functions. We've summarized the changes below.\n\n\n\n\nAdded new sklearn models and preprocessors\n\n\n\n\nAdaBoostClassifier\n\n\nBernoulliNB\n\n\nExtraTreesClassifier\n\n\nGaussianNB\n\n\nMultinomialNB\n\n\nLinearSVC\n\n\nPassiveAggressiveClassifier\n\n\nGradientBoostingClassifier\n\n\nRBFSampler\n\n\nFastICA\n\n\nFeatureAgglomeration\n\n\nNystroem\n\n\n\n\nAdded operator that inserts virtual features for the count of features with values of zero\n\n\nReworked parameterization of TPOT operators\n\n\n\nReduced parameter search space with information from a scikit-learn benchmark\n\n\nTPOT no longer generates arbitrary parameter values, but uses a fixed parameter set instead\n\n\n\n\nRemoved XGBoost as a dependency\n\n\n\nToo many users were having install issues with XGBoost\n\n\nReplaced with scikit-learn's GradientBoostingClassifier\n\n\n\n\nImproved descriptiveness of TPOT command line parameter documentation\n\n\nRemoved min/max/avg details during fit() when verbosity > 1\n\n\n\n\nReplaced with tqdm progress bar\n\n\nAdded tqdm as a dependency\n\n\n\n\nAdded \nfit_predict()\n convenience function\n\n\nAdded \nget_params()\n function so TPOT can operate in scikit-learn's \ncross_val_score\n & related functions\n\n\n\n\n\nVersion 0.3\n\n\n\n\nWe revised the internal optimization process of TPOT to make it more efficient, in particular in regards to the model parameters that TPOT optimizes over.\n\n\n\n\nVersion 0.2\n\n\n\n\n\n\nTPOT now has the ability to export the optimized pipelines to sklearn code.\n\n\n\n\n\n\nLogistic regression, SVM, and k-nearest neighbors classifiers were added as pipeline operators. Previously, TPOT only included decision tree and random forest classifiers.\n\n\n\n\n\n\nTPOT can now use arbitrary scoring functions for the optimization process.\n\n\n\n\n\n\nTPOT now performs multi-objective Pareto optimization to balance model complexity (i.e., # of pipeline operators) and the score of the pipeline.\n\n\n\n\n\n\nVersion 0.1\n\n\n\n\n\n\nFirst public release of TPOT.\n\n\n\n\n\n\nOptimizes pipelines with decision trees and random forest classifiers as the model, and uses a handful of feature preprocessors.", + "text": "Version 0.7\n\n\n\n\n\n\nTPOT now has multiprocessing support (Linux and macOS only).\n TPOT allows you to use multiple processes for accelerating pipeline optimization in TPOT with the \nn_jobs\n parameter in both TPOTClassifier and TPOTRegressor.\n\n\n\n\n\n\nTPOT now allows you to \ncustomize the operators and parameters explored during the optimization process.\n TPOT allows you to customize the list of operators and parameters in optimization process of TPOT with the \nconfig_dict\n parameter. The format of this customized dictionary can be found in the \nonline documentation\n.\n\n\n\n\n\n\nTPOT now allows you to \nspecify a time limit for evaluating a single pipeline\n (default limit is 5 minutes) in optimization process with the \nmax_eval_time_mins\n parameter, so TPOT won't spend hours evaluating overly-complex pipelines.\n\n\n\n\n\n\nWe tweaked TPOT's underlying evolutionary optimization algorithm to work even better, including using the \nmu+lambda algorithm\n. This algorithm gives you more control of how many pipelines are generated every iteration with the \noffspring_size\n parameter.\n\n\n\n\n\n\nFixed a reproducibility issue where setting \nrandom_seed\n didn't necessarily result in the same results every time. This bug was present since version 0.6.\n\n\n\n\n\n\nRefined the default operators and parameters in TPOT, so TPOT 0.7 should work even better than 0.6.\n\n\n\n\n\n\nTPOT now supports sample weights in the fitness function if some if your samples are more important to classify correctly than others. The sample weights option works the same as in scikit-learn, e.g., \ntpot.fit(x_train, y_train, sample_weights=sample_weights)\n.\n\n\n\n\n\n\nThe default scoring metric in TPOT has been changed from balanced accuracy to accuracy, the same default metric for classification algorithms in scikit-learn. Balanced accuracy can still be used by setting \nscoring='balanced_accuracy'\n when creating a TPOT instance.\n\n\n\n\n\n\nVersion 0.6\n\n\n\n\n\n\nTPOT now supports regression problems!\n We have created two separate \nTPOTClassifier\n and \nTPOTRegressor\n classes to support classification and regression problems, respectively. The \ncommand-line interface\n also supports this feature through the \n-mode\n parameter.\n\n\n\n\n\n\nTPOT now allows you to \nspecify a time limit\n for the optimization process with the \nmax_time_mins\n parameter, so you don't need to guess how long TPOT will take any more to recommend a pipeline to you.\n\n\n\n\n\n\nAdded a new operator that performs feature selection using \nExtraTrees\n feature importance scores.\n\n\n\n\n\n\nXGBoost\n has been added as an optional dependency to TPOT.\n If you have XGBoost installed, TPOT will automatically detect your installation and use the \nXGBoostClassifier\n and \nXGBoostRegressor\n in its pipelines.\n\n\n\n\n\n\nTPOT now offers a verbosity level of 3 (\"science mode\"), which outputs the entire Pareto front instead of only the current best score. This feature may be useful for users looking to make a trade-off between pipeline complexity and score.\n\n\n\n\n\n\nVersion 0.5\n\n\n\n\nMajor refactor: Each operator is defined in a separate class file. Hooray for easier-to-maintain code!\n\n\nTPOT now \nexports directly to scikit-learn Pipelines\n instead of hacky code.\n\n\nInternal representation of individuals now uses scikit-learn pipelines.\n\n\nParameters for each operator have been optimized so TPOT spends less time exploring useless parameters.\n\n\nWe have removed pandas as a dependency and instead use numpy matrices to store the data.\n\n\nTPOT now uses \nk-fold cross-validation\n when evaluating pipelines, with a default k = 3. This k parameter can be tuned when creating a new TPOT instance.\n\n\nImproved \nscoring function support\n: Even though TPOT uses balanced accuracy by default, you can now have TPOT use \nany of the scoring functions\n that \ncross_val_score\n supports.\n\n\nAdded the scikit-learn \nNormalizer\n preprocessor.\n\n\nMinor text fixes.\n\n\n\n\nVersion 0.4\n\n\nIn TPOT 0.4, we've made some major changes to the internals of TPOT and added some convenience functions. We've summarized the changes below.\n\n\n\n\nAdded new sklearn models and preprocessors\n\n\n\n\nAdaBoostClassifier\n\n\nBernoulliNB\n\n\nExtraTreesClassifier\n\n\nGaussianNB\n\n\nMultinomialNB\n\n\nLinearSVC\n\n\nPassiveAggressiveClassifier\n\n\nGradientBoostingClassifier\n\n\nRBFSampler\n\n\nFastICA\n\n\nFeatureAgglomeration\n\n\nNystroem\n\n\n\n\nAdded operator that inserts virtual features for the count of features with values of zero\n\n\nReworked parameterization of TPOT operators\n\n\n\nReduced parameter search space with information from a scikit-learn benchmark\n\n\nTPOT no longer generates arbitrary parameter values, but uses a fixed parameter set instead\n\n\n\n\nRemoved XGBoost as a dependency\n\n\n\nToo many users were having install issues with XGBoost\n\n\nReplaced with scikit-learn's GradientBoostingClassifier\n\n\n\n\nImproved descriptiveness of TPOT command line parameter documentation\n\n\nRemoved min/max/avg details during fit() when verbosity > 1\n\n\n\n\nReplaced with tqdm progress bar\n\n\nAdded tqdm as a dependency\n\n\n\n\nAdded \nfit_predict()\n convenience function\n\n\nAdded \nget_params()\n function so TPOT can operate in scikit-learn's \ncross_val_score\n & related functions\n\n\n\n\n\nVersion 0.3\n\n\n\n\nWe revised the internal optimization process of TPOT to make it more efficient, in particular in regards to the model parameters that TPOT optimizes over.\n\n\n\n\nVersion 0.2\n\n\n\n\n\n\nTPOT now has the ability to export the optimized pipelines to sklearn code.\n\n\n\n\n\n\nLogistic regression, SVM, and k-nearest neighbors classifiers were added as pipeline operators. Previously, TPOT only included decision tree and random forest classifiers.\n\n\n\n\n\n\nTPOT can now use arbitrary scoring functions for the optimization process.\n\n\n\n\n\n\nTPOT now performs multi-objective Pareto optimization to balance model complexity (i.e., # of pipeline operators) and the score of the pipeline.\n\n\n\n\n\n\nVersion 0.1\n\n\n\n\n\n\nFirst public release of TPOT.\n\n\n\n\n\n\nOptimizes pipelines with decision trees and random forest classifiers as the model, and uses a handful of feature preprocessors.", "title": "Release Notes" }, + { + "location": "/releases/#version-07", + "text": "TPOT now has multiprocessing support (Linux and macOS only). TPOT allows you to use multiple processes for accelerating pipeline optimization in TPOT with the n_jobs parameter in both TPOTClassifier and TPOTRegressor. TPOT now allows you to customize the operators and parameters explored during the optimization process. TPOT allows you to customize the list of operators and parameters in optimization process of TPOT with the config_dict parameter. The format of this customized dictionary can be found in the online documentation . TPOT now allows you to specify a time limit for evaluating a single pipeline (default limit is 5 minutes) in optimization process with the max_eval_time_mins parameter, so TPOT won't spend hours evaluating overly-complex pipelines. We tweaked TPOT's underlying evolutionary optimization algorithm to work even better, including using the mu+lambda algorithm . This algorithm gives you more control of how many pipelines are generated every iteration with the offspring_size parameter. Fixed a reproducibility issue where setting random_seed didn't necessarily result in the same results every time. This bug was present since version 0.6. Refined the default operators and parameters in TPOT, so TPOT 0.7 should work even better than 0.6. TPOT now supports sample weights in the fitness function if some if your samples are more important to classify correctly than others. The sample weights option works the same as in scikit-learn, e.g., tpot.fit(x_train, y_train, sample_weights=sample_weights) . The default scoring metric in TPOT has been changed from balanced accuracy to accuracy, the same default metric for classification algorithms in scikit-learn. Balanced accuracy can still be used by setting scoring='balanced_accuracy' when creating a TPOT instance.", + "title": "Version 0.7" + }, { "location": "/releases/#version-06", "text": "TPOT now supports regression problems! We have created two separate TPOTClassifier and TPOTRegressor classes to support classification and regression problems, respectively. The command-line interface also supports this feature through the -mode parameter. TPOT now allows you to specify a time limit for the optimization process with the max_time_mins parameter, so you don't need to guess how long TPOT will take any more to recommend a pipeline to you. Added a new operator that performs feature selection using ExtraTrees feature importance scores. XGBoost has been added as an optional dependency to TPOT. If you have XGBoost installed, TPOT will automatically detect your installation and use the XGBoostClassifier and XGBoostRegressor in its pipelines. TPOT now offers a verbosity level of 3 (\"science mode\"), which outputs the entire Pareto front instead of only the current best score. This feature may be useful for users looking to make a trade-off between pipeline complexity and score.", diff --git a/docs/releases/index.html b/docs/releases/index.html index cafbe63a..bd19a426 100644 --- a/docs/releases/index.html +++ b/docs/releases/index.html @@ -45,113 +45,71 @@ @@ -188,7 +146,34 @@
-

Version 0.6

+

Version 0.7

+
    +
  • +

    TPOT now has multiprocessing support (Linux and macOS only). TPOT allows you to use multiple processes for accelerating pipeline optimization in TPOT with the n_jobs parameter in both TPOTClassifier and TPOTRegressor.

    +
  • +
  • +

    TPOT now allows you to customize the operators and parameters explored during the optimization process. TPOT allows you to customize the list of operators and parameters in optimization process of TPOT with the config_dict parameter. The format of this customized dictionary can be found in the online documentation.

    +
  • +
  • +

    TPOT now allows you to specify a time limit for evaluating a single pipeline (default limit is 5 minutes) in optimization process with the max_eval_time_mins parameter, so TPOT won't spend hours evaluating overly-complex pipelines.

    +
  • +
  • +

    We tweaked TPOT's underlying evolutionary optimization algorithm to work even better, including using the mu+lambda algorithm. This algorithm gives you more control of how many pipelines are generated every iteration with the offspring_size parameter.

    +
  • +
  • +

    Fixed a reproducibility issue where setting random_seed didn't necessarily result in the same results every time. This bug was present since version 0.6.

    +
  • +
  • +

    Refined the default operators and parameters in TPOT, so TPOT 0.7 should work even better than 0.6.

    +
  • +
  • +

    TPOT now supports sample weights in the fitness function if some if your samples are more important to classify correctly than others. The sample weights option works the same as in scikit-learn, e.g., tpot.fit(x_train, y_train, sample_weights=sample_weights).

    +
  • +
  • +

    The default scoring metric in TPOT has been changed from balanced accuracy to accuracy, the same default metric for classification algorithms in scikit-learn. Balanced accuracy can still be used by setting scoring='balanced_accuracy' when creating a TPOT instance.

    +
  • +
+

Version 0.6

  • TPOT now supports regression problems! We have created two separate TPOTClassifier and TPOTRegressor classes to support classification and regression problems, respectively. The command-line interface also supports this feature through the -mode parameter.

    @@ -307,7 +292,7 @@

    Version 0.1

    -

    Copyright © 2016-Present Randal S. Olson

    +

    Copyright © 2015-Present Randal S. Olson

    @@ -324,7 +309,7 @@

    Version 0.1

    - GitHub + GitHub « Previous diff --git a/docs/search.html b/docs/search.html index c72c7332..ac514832 100644 --- a/docs/search.html +++ b/docs/search.html @@ -41,91 +41,47 @@ @@ -180,7 +136,7 @@

    Search Results

    -

    Copyright © 2016-Present Randal S. Olson

    +

    Copyright © 2015-Present Randal S. Olson

    @@ -197,7 +153,7 @@

    Search Results

    - GitHub + GitHub diff --git a/docs/sitemap.xml b/docs/sitemap.xml index e1135dd1..59705cad 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -4,7 +4,7 @@ http://rhiever.github.io/tpot/ - 2017-01-17 + 2017-03-22 daily @@ -12,7 +12,7 @@ http://rhiever.github.io/tpot/installing/ - 2017-01-17 + 2017-03-22 daily @@ -20,43 +20,22 @@ http://rhiever.github.io/tpot/using/ - 2017-01-17 + 2017-03-22 daily - - http://rhiever.github.io/tpot/examples/MNIST_Example/ - 2017-01-17 + http://rhiever.github.io/tpot/examples/ + 2017-03-22 daily - - - http://rhiever.github.io/tpot/examples/IRIS_Example/ - 2017-01-17 - daily - - - - http://rhiever.github.io/tpot/examples/Boston_Example/ - 2017-01-17 - daily - - - - http://rhiever.github.io/tpot/examples/Titanic_Kaggle_Example/ - 2017-01-17 - daily - - - http://rhiever.github.io/tpot/contributing/ - 2017-01-17 + 2017-03-22 daily @@ -64,7 +43,7 @@ http://rhiever.github.io/tpot/releases/ - 2017-01-17 + 2017-03-22 daily @@ -72,7 +51,7 @@ http://rhiever.github.io/tpot/citing/ - 2017-01-17 + 2017-03-22 daily @@ -80,9 +59,9 @@ http://rhiever.github.io/tpot/support/ - 2017-01-17 + 2017-03-22 daily - \ No newline at end of file + diff --git a/docs/support/index.html b/docs/support/index.html index 5d97923e..dd5185c1 100644 --- a/docs/support/index.html +++ b/docs/support/index.html @@ -45,95 +45,50 @@ @@ -190,7 +145,7 @@
    -

    Copyright © 2016-Present Randal S. Olson

    +

    Copyright © 2015-Present Randal S. Olson

    @@ -207,7 +162,7 @@
    - GitHub + GitHub « Previous diff --git a/docs/using/index.html b/docs/using/index.html index d99f035b..2db911ce 100644 --- a/docs/using/index.html +++ b/docs/using/index.html @@ -45,103 +45,64 @@ @@ -207,7 +168,7 @@

    TPOT on the command line

    -mode TPOT_MODE ['classification', 'regression'] -Whether TPOT is being used for a classification or regression problem. +Whether TPOT is being used for a supervised classification or regression problem. -o @@ -219,75 +180,118 @@

    TPOT on the command line

    -g GENERATIONS Any positive integer -Number of generations to run pipeline optimization over. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize over. TPOT will evaluate GENERATIONS x POPULATION_SIZE number of pipelines in total. +Number of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. +

    +TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. -p POPULATION_SIZE Any positive integer -Number of individuals in the GP population. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize over. TPOT will evaluate GENERATIONS x POPULATION_SIZE number of pipelines in total. +Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. +

    +TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. + + +-os +OFFSPRING_SIZE +Any positive integer +Number of offspring to produce in each GP generation. +

    +By default, OFFSPRING_SIZE = POPULATION_SIZE. -mr MUTATION_RATE [0.0, 1.0] -GP mutation rate. We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. +GP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation. +

    +We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. -xr CROSSOVER_RATE [0.0, 1.0] -GP crossover rate in the range [0.0, 1.0]. We recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. +GP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to "breed" every generation. +

    +We recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. + + +-scoring +SCORING_FN +'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',
    'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' +Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression. +

    +TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized. +

    +See the section on scoring functions for more details. -cv NUM_CV_FOLDS -Any integer >2 -The number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process. +Any integer >1 +Number of folds to evaluate each pipeline over in 'k-fold cross-validation during the TPOT optimization process. --scoring -SCORING_FN -'accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' -Function used to evaluate the quality of a given pipeline for the problem. By default, balanced accuracy is used for classification and mean squared error is used for regression. TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized. See the section on scoring functions for more details. +-njobs +NUM_JOBS +Any positive integer or -1 +Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process. +

    +Assigning this to -1 will use as many cores as available on the computer. -maxtime MAX_TIME_MINS Any positive integer -How many minutes TPOT has to optimize the pipeline. This setting will override the GENERATIONS parameter and allow TPOT to run until it runs out of time. +How many minutes TPOT has to optimize the pipeline. +

    +If provided, this setting will override the "generations" parameter and allow TPOT to run until it runs out of time. -maxeval MAX_EVAL_MINS Any positive integer -How many minutes TPOT has to optimize a single pipeline. Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer. +How many minutes TPOT has to evaluate a single pipeline. +

    +Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer. -s RANDOM_STATE Any positive integer -Random number generator seed for reproducibility. Set this seed if you want your TPOT run to be reproducible with the same seed and data set in the future. +Random number generator seed for reproducibility. +

    +Set this seed if you want your TPOT run to be reproducible with the same seed and data set in the future. + + +-config +CONFIG_FILE +String path to a file +Configuration file for customizing the operators and parameters that TPOT uses in the optimization process. +

    +See the custom configuration section for more information and examples. -v VERBOSITY {0, 1, 2, 3} -How much information TPOT communicates while it is running: 0 = none, 1 = minimal, 2 = all. A setting of 2 or higher will add a progress bar during the optimization procedure. +How much information TPOT communicates while it is running. +

    +0 = none, 1 = minimal, 2 = high, 3 = all. +

    +A setting of 2 or higher will add a progress bar during the optimization procedure. ---no-update-check -N/A +--no-update-check Flag indicating whether the TPOT version checker should be disabled. ---version -N/A +--version Show TPOT's version number and exit. ---help -N/A +--help Show TPOT's help documentation and exit. @@ -317,54 +321,104 @@

    TPOT with code

    Effect -generation +generations Any positive integer -The number of generations to run pipeline optimization over. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total. +Number of iterations to the run pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. +

    +TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. population_size Any positive integer -The number of individuals in the GP population. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize over. TPOT will evaluate generations x population_size number of pipelines in total. +Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. +

    +TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. + + +offspring_size +Any positive integer +Number of offspring to produce in each GP generation. +

    +By default, offspring_size = population_size. mutation_rate [0.0, 1.0] -The mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to apply random changes to every generation. We don't recommend that you tweak this parameter unless you know what you're doing. +Mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. +

    +We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. crossover_rate [0.0, 1.0] -The crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This tells the genetic programming algorithm how many pipelines to "breed" every generation. We don't recommend that you tweak this parameter unless you know what you're doing. +Crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation. +

    +We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. -num_cv_folds -[2, 10] -The number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT pipeline optimization process. +scoring +'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',
    'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature scorer(y_true, y_pred) +Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression. +

    +TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized. +

    +See the section on scoring functions for more details. -scoring -'accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature scorer(y_true, y_pred) -Function used to evaluate the quality of a given pipeline for the problem. By default, balanced accuracy is used for classification and mean squared error is used for regression. TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized. See the section on scoring functions for more details. +cv +Any integer >1 +Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process. + + +n_jobs +Any positive integer or -1 +Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process. +

    +Assigning this to -1 will use as many cores as available on the computer. max_time_mins Any positive integer -How many minutes TPOT has to optimize the pipeline. This setting will override the generations parameter. +How many minutes TPOT has to optimize the pipeline. +

    +If provided, this setting will override the "generations" parameter and allow TPOT to run until it runs out of time. max_eval_time_mins Any positive integer -How many minutes TPOT has to optimize a single pipeline. Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer. +How many minutes TPOT has to optimize a single pipeline. +

    +Setting this parameter to higher values will allow TPOT to explore more complex pipelines, but will also allow TPOT to run longer. random_state Any positive integer -The random number generator seed for TPOT. Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. +Random number generator seed for TPOT. +

    +Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. + + +config_dict +Python dictionary +Configuration dictionary for customizing the operators and parameters that TPOT uses in the optimization process. +

    +See the custom configuration section for more information and examples. + + + + +warm_start +[True, False] +Flag indicating whether the TPOT instance will reuse the population from previous calls to fit(). verbosity {0, 1, 2, 3} -How much information TPOT communicates while it's running. 0 = none, 1 = minimal, 2 = high, 3 = all. A setting of 2 or higher will add a progress bar to calls to fit(). +How much information TPOT communicates while it's running. +

    +0 = none, 1 = minimal, 2 = high, 3 = all. +

    +A setting of 2 or higher will add a progress bar during the optimization procedure. disable_update_check @@ -376,13 +430,15 @@

    TPOT with code

    Some example code with custom TPOT parameters might look like:

    from tpot import TPOTClassifier
     
    -pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)
    +pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
    +                                    random_state=42, verbosity=2)
     

    Now TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the fit function:

    from tpot import TPOTClassifier
     
    -pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)
    +pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
    +                                    random_state=42, verbosity=2)
     pipeline_optimizer.fit(training_features, training_classes)
     
    @@ -390,7 +446,8 @@

    TPOT with code

    You can then proceed to evaluate the final pipeline on the testing set with the score() function:

    from tpot import TPOTClassifier
     
    -pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)
    +pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
    +                                    random_state=42, verbosity=2)
     pipeline_optimizer.fit(training_features, training_classes)
     print(pipeline_optimizer.score(testing_features, testing_classes))
     
    @@ -398,28 +455,98 @@

    TPOT with code

    Finally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the export() function:

    from tpot import TPOTClassifier
     
    -pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, num_cv_folds=5, random_state=42, verbosity=2)
    +pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
    +                                    random_state=42, verbosity=2)
     pipeline_optimizer.fit(training_features, training_classes)
     print(pipeline_optimizer.score(testing_features, testing_classes))
     pipeline_optimizer.export('tpot_exported_pipeline.py')
     

    Once this code finishes running, tpot_exported_pipeline.py will contain the Python code for the optimized pipeline.

    -

    Check our examples to see TPOT applied to some specific data sets.

    +

    Check our examples to see TPOT applied to some specific data sets.

    Scoring functions

    -

    TPOT makes use of sklearn.model_selection.cross_val_score, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:

    +

    TPOT makes use of sklearn.model_selection.cross_val_score for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:

    1. -

      You can pass in a string from the list described in the table above. Any other strings will cause internal issues that may break your code down the line.

      +

      You can pass in a string to the scoring parameter from the list above. Any other strings will cause TPOT to throw an exception.

    2. -

      You can pass in a function with the signature scorer(y_true, y_pred), where y_true are the true target values and y_pred are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.

      +

      You can pass a function with the signature scorer(y_true, y_pred), where y_true are the true target values and y_pred are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.

    -
    def accuracy(y_true, y_pred):
    +
    from tpot import TPOTClassifier
    +from sklearn.datasets import load_digits
    +from sklearn.model_selection import train_test_split
    +
    +digits = load_digits()
    +X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
    +                                                    train_size=0.75, test_size=0.25)
    +
    +def accuracy(y_true, y_pred):
         return float(sum(y_pred == y_true)) / len(y_true)
    +
    +tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
    +                      scoring=accuracy)
    +tpot.fit(X_train, y_train)
    +print(tpot.score(X_test, y_test))
    +tpot.export('tpot_mnist_pipeline.py')
    +
    + +

    +

    Customizing TPOT's operators and parameters

    +

    TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. However, in some cases it is useful to limit the algorithms and parameters that TPOT explores. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.

    +

    The custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., sklearn.naive_bayes.MultinomialNB) and the second level key is the corresponding parameter name for that operator (e.g., fit_prior). The second level key should point to a list of parameter values for that parameter, e.g., 'fit_prior': [True, False].

    +

    For a simple example, the configuration could be:

    +
    classifier_config_dict = {
    +    'sklearn.naive_bayes.GaussianNB': {
    +    },
    +    'sklearn.naive_bayes.BernoulliNB': {
    +        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
    +        'fit_prior': [True, False]
    +    },
    +    'sklearn.naive_bayes.MultinomialNB': {
    +        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
    +        'fit_prior': [True, False]
    +    }
    +}
    +
    + +

    in which case TPOT would only explore pipelines containing GaussianNB, BernoulliNB, MultinomialNB, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the TPOTClassifier/TPOTRegressor config_dict parameter, described above. For example:

    +
    from tpot import TPOTClassifier
    +from sklearn.datasets import load_digits
    +from sklearn.model_selection import train_test_split
    +
    +digits = load_digits()
    +X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
    +                                                    train_size=0.75, test_size=0.25)
    +
    +classifier_config_dict = {
    +    'sklearn.naive_bayes.GaussianNB': {
    +    },
    +    'sklearn.naive_bayes.BernoulliNB': {
    +        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
    +        'fit_prior': [True, False]
    +    },
    +    'sklearn.naive_bayes.MultinomialNB': {
    +        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
    +        'fit_prior': [True, False]
    +    }
    +}
    +
    +tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
    +                      config_dict=classifier_config_dict)
    +tpot.fit(X_train, y_train)
    +print(tpot.score(X_test, y_test))
    +tpot.export('tpot_mnist_pipeline.py')
     
    + +

    Command-line users must create a separate .py file with the custom configuration and provide the path to the file to the tpot call. For example, if the simple example configuration above is saved in tpot_classifier_config.py, that configuration could be used on the command line with the command:

    +
    tpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py
    +
    + +

    For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for classification and regression in TPOT's source code.

    +

    Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it explores.

    @@ -427,7 +554,7 @@

    Scoring functions