Merge remote-tracking branch 'upstream/development' into unit_tests_f…

…or_input_check
EpistasisLab · Mar 22, 2017 · b49f1ac · b49f1ac
2 parents f48b31c + 0920fc9
commit b49f1ac
Show file tree

Hide file tree

Showing 4 changed files with 165 additions and 54 deletions.
diff --git a/docs/index.html b/docs/index.html
@@ -245,5 +245,5 @@
 
 <!--
 MkDocs version : 0.16.0
-Build Date UTC : 2017-03-22 15:56:01
+Build Date UTC : 2017-03-22 18:39:33
 -->
diff --git a/docs/mkdocs/search_index.json b/docs/mkdocs/search_index.json
diff --git a/docs/using/index.html b/docs/using/index.html
@@ -73,6 +73,8 @@
 
                     <li><a class="toctree-l4" href="#scoring-functions">Scoring functions</a></li>
 
+                    <li><a class="toctree-l4" href="#customizing-tpots-operators-and-parameters">Customizing TPOT's operators and parameters</a></li>
+
 
             </ul>
 
@@ -285,22 +287,7 @@ <h1 id="tpot-on-the-command-line">TPOT on the command line</h1>
 <td>-config</td>
 <td>CONFIG_FILE</td>
 <td>String path to a file</td>
-<td>Configuration file for customizing the operators and parameters that TPOT uses in the optimization process. For example, the configuration file's format could be like:
-<pre lang="nemerle">
-classifier_config_dict = {
-    'sklearn.naive_bayes.GaussianNB': {
-    },
-    'sklearn.naive_bayes.BernoulliNB': {
-        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
-        'fit_prior': [True, False]
-    },
-    'sklearn.naive_bayes.MultinomialNB': {
-        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
-        'fit_prior': [True, False]
-    }
-}
-</pre>
-</td>
+<td>Configuration file for customizing the operators and parameters that TPOT uses in the optimization process. See the <a href="#customconfig">custom configuration</a> section for more information and examples.</td>
 </tr>
 <tr>
 <td>-v</td>
@@ -407,20 +394,7 @@ <h1 id="tpot-with-code">TPOT with code</h1>
 <tr>
 <td>config_dict</td>
 <td>Python dictionary</td>
-<td>Configuration dictionary for customizing the operators and parameters that TPOT uses in the optimization process. For example:
-<pre lang="nemerle">
-classifier_config_dict = {
-    'sklearn.naive_bayes.GaussianNB': {
-    },
-    'sklearn.naive_bayes.BernoulliNB': {
-        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
-        'fit_prior': [True, False]
-    },
-    'sklearn.naive_bayes.MultinomialNB': {
-        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
-        'fit_prior': [True, False]
-    }
-}
+<td>Configuration dictionary for customizing the operators and parameters that TPOT uses in the optimization process. See the <a href="#customconfig">custom configuration</a> section for more information and examples.
 </pre>
 </td>
 </tr>
@@ -444,29 +418,33 @@ <h1 id="tpot-with-code">TPOT with code</h1>
 <p>Some example code with custom TPOT parameters might look like:</p>
 <pre><code class="Python">from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 </code></pre>
 
 <p>Now TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the <code>fit</code> function:</p>
 <pre><code class="Python">from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 pipeline_optimizer.fit(training_features, training_classes)
 </code></pre>
 
 <p>The <code>fit()</code> function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then initializes the genetic programming algoritm to find the best pipeline based on average k-fold score.</p>
 <p>You can then proceed to evaluate the final pipeline on the testing set with the <code>score()</code> function:</p>
 <pre><code class="Python">from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 pipeline_optimizer.fit(training_features, training_classes)
 print(pipeline_optimizer.score(testing_features, testing_classes))
 </code></pre>
 
 <p>Finally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the <code>export()</code> function:</p>
 <pre><code class="Python">from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 pipeline_optimizer.fit(training_features, training_classes)
 print(pipeline_optimizer.score(testing_features, testing_classes))
 pipeline_optimizer.export('tpot_exported_pipeline.py')
@@ -476,18 +454,87 @@ <h1 id="tpot-with-code">TPOT with code</h1>
 <p>Check our <a href="../examples/MNIST_Example/">examples</a> to see TPOT applied to some specific data sets.</p>
 <p><a name="scoringfunctions"></a></p>
 <h2 id="scoring-functions">Scoring functions</h2>
-<p>TPOT makes use of <code>sklearn.model_selection.cross_val_score</code>, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:</p>
+<p>TPOT makes use of <code>sklearn.model_selection.cross_val_score</code> for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:</p>
 <ol>
 <li>
-<p>You can pass in a string from the list described in the table above. Any other strings will cause internal issues that may break your code down the line.</p>
+<p>You can pass in a string to the <code>scoring</code> parameter from the list above. Any other strings will cause TPOT to throw an exception.</p>
 </li>
 <li>
-<p>You can pass in a function with the signature <code>scorer(y_true, y_pred)</code>, where <code>y_true</code> are the true target values and <code>y_pred</code> are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.</p>
+<p>You can pass a function with the signature <code>scorer(y_true, y_pred)</code>, where <code>y_true</code> are the true target values and <code>y_pred</code> are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.</p>
 </li>
 </ol>
-<pre><code class="Python">def accuracy(y_true, y_pred):
+<pre><code class="Python">from tpot import TPOTClassifier
+from sklearn.datasets import load_digits
+from sklearn.model_selection import train_test_split
+
+digits = load_digits()
+X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
+                                                    train_size=0.75, test_size=0.25)
+
+def accuracy(y_true, y_pred):
     return float(sum(y_pred == y_true)) / len(y_true)
+
+tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
+                      scoring=accuracy)
+tpot.fit(X_train, y_train)
+print(tpot.score(X_test, y_test))
+tpot.export('tpot_mnist_pipeline.py')
 </code></pre>
+
+<p><a name="customconfig"></a></p>
+<h2 id="customizing-tpots-operators-and-parameters">Customizing TPOT's operators and parameters</h2>
+<p>TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. However, in some cases it is useful to limit the algorithms and parameters that TPOT explores. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.</p>
+<p>The custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., <code>sklearn.naive_bayes.MultinomialNB</code>) and the second level key is the corresponding parameter name for that operator (e.g., <code>fit_prior</code>). The second level key should point to a list of parameter values for that parameter, e.g., <code>'fit_prior': [True, False]</code>.</p>
+<p>For a simple example, the configuration could be:</p>
+<pre><code class="Python">classifier_config_dict = {
+    'sklearn.naive_bayes.GaussianNB': {
+    },
+    'sklearn.naive_bayes.BernoulliNB': {
+        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
+        'fit_prior': [True, False]
+    },
+    'sklearn.naive_bayes.MultinomialNB': {
+        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
+        'fit_prior': [True, False]
+    }
+}
+</code></pre>
+
+<p>in which case TPOT would only explore pipelines containing <code>GaussianNB</code>, <code>BernoulliNB</code>, <code>MultinomialNB</code>, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the <code>TPOTClassifier</code>/<code>TPOTRegressor</code> <code>config_dict</code> parameter, described above. For example:</p>
+<pre><code class="Python">from tpot import TPOTClassifier
+from sklearn.datasets import load_digits
+from sklearn.model_selection import train_test_split
+
+digits = load_digits()
+X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
+                                                    train_size=0.75, test_size=0.25)
+
+classifier_config_dict = {
+    'sklearn.naive_bayes.GaussianNB': {
+    },
+    'sklearn.naive_bayes.BernoulliNB': {
+        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
+        'fit_prior': [True, False]
+    },
+    'sklearn.naive_bayes.MultinomialNB': {
+        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
+        'fit_prior': [True, False]
+    }
+}
+
+tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
+                      config_dict=classifier_config_dict)
+tpot.fit(X_train, y_train)
+print(tpot.score(X_test, y_test))
+tpot.export('tpot_mnist_pipeline.py')
+</code></pre>
+
+<p>Command-line users must create a separate <code>.py</code> file with the custom configuration and provide the path to the file to the <code>tpot</code> call. For example, if the simple example configuration above is saved in <code>tpot_classifier_config.py</code>, that configuration could be used on the command line with the command:</p>
+<pre><code>tpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py
+</code></pre>
+
+<p>For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for <a href="https://github.com/rhiever/tpot/blob/master/tpot/config_classifier.py">classification</a> and <a href="https://github.com/rhiever/tpot/blob/master/tpot/config_regressor.py">regression</a> in TPOT's source code.</p>
+<p>Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it explores.</p>
 
             </div>
           </div>

diff --git a/docs_sources/using.md b/docs_sources/using.md
@@ -252,15 +252,17 @@ Some example code with custom TPOT parameters might look like:
 ```Python
 from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 ```
 
 Now TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the `fit` function:
 
 ```Python
 from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 pipeline_optimizer.fit(training_features, training_classes)
 ```
 
@@ -271,7 +273,8 @@ You can then proceed to evaluate the final pipeline on the testing set with the
 ```Python
 from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 pipeline_optimizer.fit(training_features, training_classes)
 print(pipeline_optimizer.score(testing_features, testing_classes))
 ```
@@ -281,7 +284,8 @@ Finally, you can tell TPOT to export the corresponding Python code for the optim
 ```Python
 from tpot import TPOTClassifier
 
-pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)
+pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
+                                    random_state=42, verbosity=2)
 pipeline_optimizer.fit(training_features, training_classes)
 print(pipeline_optimizer.score(testing_features, testing_classes))
 pipeline_optimizer.export('tpot_exported_pipeline.py')
@@ -294,25 +298,41 @@ Check our [examples](examples/MNIST_Example/) to see TPOT applied to some specif
 <a name="scoringfunctions"></a>
 ## Scoring functions
 
-TPOT makes use of `sklearn.model_selection.cross_val_score`, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:
+TPOT makes use of `sklearn.model_selection.cross_val_score` for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:
 
-1. You can pass in a string from the list described in the table above. Any other strings will cause internal issues that may break your code down the line.
+1. You can pass in a string to the `scoring` parameter from the list above. Any other strings will cause TPOT to throw an exception.
 
-2. You can pass in a function with the signature `scorer(y_true, y_pred)`, where `y_true` are the true target values and `y_pred` are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.
+2. You can pass a function with the signature `scorer(y_true, y_pred)`, where `y_true` are the true target values and `y_pred` are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.
 
 ```Python
+from tpot import TPOTClassifier
+from sklearn.datasets import load_digits
+from sklearn.model_selection import train_test_split
+
+digits = load_digits()
+X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
+                                                    train_size=0.75, test_size=0.25)
+
 def accuracy(y_true, y_pred):
     return float(sum(y_pred == y_true)) / len(y_true)
+
+tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
+                      scoring=accuracy)
+tpot.fit(X_train, y_train)
+print(tpot.score(X_test, y_test))
+tpot.export('tpot_mnist_pipeline.py')
 ```
 
 <a name="customconfig"></a>
 ## Customizing TPOT's operators and parameters
 
-TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. However, sometimes it's useful to limit the algorithms and parameters that TPOT explores
+TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. However, in some cases it is useful to limit the algorithms and parameters that TPOT explores. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.
 
-For example, the configuration file's format could be like:
+The custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., `sklearn.naive_bayes.MultinomialNB`) and the second level key is the corresponding parameter name for that operator (e.g., `fit_prior`). The second level key should point to a list of parameter values for that parameter, e.g., `'fit_prior': [True, False]`.
 
-<pre lang="nemerle">
+For a simple example, the configuration could be:
+
+```Python
 classifier_config_dict = {
     'sklearn.naive_bayes.GaussianNB': {
     },
@@ -325,6 +345,45 @@ classifier_config_dict = {
         'fit_prior': [True, False]
     }
 }
-</pre>
+```
+
+in which case TPOT would only explore pipelines containing `GaussianNB`, `BernoulliNB`, `MultinomialNB`, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the `TPOTClassifier`/`TPOTRegressor` `config_dict` parameter, described above. For example:
+
+```Python
+from tpot import TPOTClassifier
+from sklearn.datasets import load_digits
+from sklearn.model_selection import train_test_split
+
+digits = load_digits()
+X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
+                                                    train_size=0.75, test_size=0.25)
+
+classifier_config_dict = {
+    'sklearn.naive_bayes.GaussianNB': {
+    },
+    'sklearn.naive_bayes.BernoulliNB': {
+        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
+        'fit_prior': [True, False]
+    },
+    'sklearn.naive_bayes.MultinomialNB': {
+        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
+        'fit_prior': [True, False]
+    }
+}
+
+tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
+                      config_dict=classifier_config_dict)
+tpot.fit(X_train, y_train)
+print(tpot.score(X_test, y_test))
+tpot.export('tpot_mnist_pipeline.py')
+```
+
+Command-line users must create a separate `.py` file with the custom configuration and provide the path to the file to the `tpot` call. For example, if the simple example configuration above is saved in `tpot_classifier_config.py`, that configuration could be used on the command line with the command:
+
+```
+tpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py
+```
 
+For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for [classification](https://github.com/rhiever/tpot/blob/master/tpot/config_classifier.py) and [regression](https://github.com/rhiever/tpot/blob/master/tpot/config_regressor.py) in TPOT's source code.
 
+Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it explores.