From 5513d52547aa6faaa367c2b1658d99f04f368727 Mon Sep 17 00:00:00 2001 From: chico Date: Mon, 8 Mar 2021 20:31:51 +0100 Subject: [PATCH 1/5] First push of the developer documentation --- docs/conf.py | 1 + docs/dev.rst | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ docs/index.rst | 1 + 3 files changed, 50 insertions(+) create mode 100644 docs/dev.rst diff --git a/docs/conf.py b/docs/conf.py index 1cbe6fe56..5af74617d 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -181,6 +181,7 @@ ('Manual', 'manual'), ('Examples', 'examples/index'), ('API', 'api'), + ('Dev', 'dev'), ('Extending', 'extending'), ], diff --git a/docs/dev.rst b/docs/dev.rst new file mode 100644 index 000000000..25145e513 --- /dev/null +++ b/docs/dev.rst @@ -0,0 +1,48 @@ +:orphan: + +.. _dev: + +======================= +Developer Documentation +======================= + +This documentation summarizes how the AutoPyTorch code works, and it is meant to guide developers +on how to best contribute to it. + +AutoPyTorch relies on the `SMAC `_ library to build individual models, +which are later ensembled together using ensemble selection by `Caruana et al. (2004) `_. +Therefore, there are two main parts of the code: `AutoMLSMBO`, which is our interface to the SMAC package, and +`EnsembleBuilder` which opportunistically builds an ensemble of the individual algorithms found by SMAC, at fixed intervals. +The following sections provides details regarding this two main blocks of code. + +Building Individual Models +========================== + +AutoPyTorch relies on Scikit-Learn `Pipeline `_ to build an individual algorithm. +In other words, each of the individual models fitted by SMAC are (and comply) with Scikit-Learn pipeline and framework. For example, when a pipeline is fitted, +we use pickle to save it to disk as stated `here `_. SMAC runs an optimization loop that proposes new +configurations based on bayesian optimization, which comply with the package `ConfigSpace `_. These configurations are +translated to a pipeline configuration, fitted and saved to disc using the function evaluator `ExecuteTaFuncWithQueue`. The later is basically a worker that that +reads a dataset from disc, fits a pipeline, and collect the performance result which is communicated back to the main process via a Queue. This worker manages +resources using `Pynisher `_, and it usually does so by creating a new process. + +Regarding multiprocessing, AutoPyTorch and SMAC work with `Dask.distributed `_. We only submits jobs to Dask up to the number of +workers, and wait for a worker to be available before continuing. + +At the end of a SMAC runs, the results will be available in the `temporary_directory` provided to the API, in particular inside of the `/smac3-output/run_/` directory. One can debug +the performance of the individual models using the file `runhistory.json` located in this area. Every individual model will be stored in `/.autoPyTorch/runs`. +In this later directory we store the fitted model (during cross-validation we store a single Voting Classifier/Regressor, which is the soft voting outcome of k-Fold cross-validation), the Out-Of-Fold +predictions that are used to build an ensemble, and also the test predictions of this model in question. + +Building the ensemble model +=========================== + +At every smac iteration, we submit a callback to create an ensemble in the case new models are written to disk. If no new models are available, no ensemble selection +is triggered. We use the OutOfFold predictions to build an ensemble via `EnsembleSelection`. This process is also submitted to Dask. Every new ensemble that is fitted is also +written to disk, where this object is mainly a container that specifies the weights one should use, to join individual model predictions. + +The AutoML Part +=============== + +The ensemble builder and the individual model constructions are both regulated by the `BaseTask`. This entity fundamentally calls the aforementioned task, and wait until +the time resource is exhausted. diff --git a/docs/index.rst b/docs/index.rst index 03022e944..d4889766d 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -31,6 +31,7 @@ Manual * :ref:`installation` * :ref:`manual` * :ref:`api` +* :ref:`dev` * :ref:`extending` From a82300c2752b8ff8c03b82f1d8d4cb5bc42098e2 Mon Sep 17 00:00:00 2001 From: chico Date: Wed, 24 Mar 2021 12:25:41 +0100 Subject: [PATCH 2/5] Feedback from Ravin --- docs/dev.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/dev.rst b/docs/dev.rst index 25145e513..4fa96408b 100644 --- a/docs/dev.rst +++ b/docs/dev.rst @@ -22,7 +22,7 @@ AutoPyTorch relies on Scikit-Learn `Pipeline `_. SMAC runs an optimization loop that proposes new configurations based on bayesian optimization, which comply with the package `ConfigSpace `_. These configurations are -translated to a pipeline configuration, fitted and saved to disc using the function evaluator `ExecuteTaFuncWithQueue`. The later is basically a worker that that +translated to a pipeline configuration, fitted and saved to disc using the function evaluator `ExecuteTaFuncWithQueue`. The latter is basically a worker that that reads a dataset from disc, fits a pipeline, and collect the performance result which is communicated back to the main process via a Queue. This worker manages resources using `Pynisher `_, and it usually does so by creating a new process. From 9851b1bf44a1d7bd1ca270c419e3e4420e6c8914 Mon Sep 17 00:00:00 2001 From: chico Date: Wed, 7 Apr 2021 11:05:28 +0200 Subject: [PATCH 3/5] Document scikit-learn develop guide --- docs/dev.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/dev.rst b/docs/dev.rst index 4fa96408b..5bf2aee43 100644 --- a/docs/dev.rst +++ b/docs/dev.rst @@ -26,6 +26,8 @@ translated to a pipeline configuration, fitted and saved to disc using the funct reads a dataset from disc, fits a pipeline, and collect the performance result which is communicated back to the main process via a Queue. This worker manages resources using `Pynisher `_, and it usually does so by creating a new process. +The Scikit-learn pipeline inherits from the `BaseEstimator `_, which implies that we have to honor the `Scikit-Learn development Guidelines `_. Of particular interest is that any estimator must define as attributes, the arguments that the class constructor receives (see `get_params and set_params` from the above documentation). + Regarding multiprocessing, AutoPyTorch and SMAC work with `Dask.distributed `_. We only submits jobs to Dask up to the number of workers, and wait for a worker to be available before continuing. @@ -46,3 +48,6 @@ The AutoML Part The ensemble builder and the individual model constructions are both regulated by the `BaseTask`. This entity fundamentally calls the aforementioned task, and wait until the time resource is exhausted. + +We also rely on the `ConfigSpace `_ package to build a configuration space and sample configurations from it. A configuration in this context, determines the content of a pipeline (for example, that the final estimator will be a MLP, or that it will have PCA as preprocessing).The set of valid configurations is determined by the configuration space. The configuration space is build using the dataset characteristics, like type +of features (categorical, numerical) or the target type (classification, regression). From e6e9a5c2f455f62c187d0cdf6700ba5fed38ad5a Mon Sep 17 00:00:00 2001 From: chico Date: Wed, 7 Apr 2021 16:58:13 +0200 Subject: [PATCH 4/5] Feedback from Ravin --- docs/dev.rst | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/docs/dev.rst b/docs/dev.rst index 5bf2aee43..80a66878a 100644 --- a/docs/dev.rst +++ b/docs/dev.rst @@ -6,34 +6,35 @@ Developer Documentation ======================= -This documentation summarizes how the AutoPyTorch code works, and it is meant to guide developers -on how to best contribute to it. +This documentation summarizes how the AutoPyTorch code works, and is meant as a guide for the developers to help contribute to it. . AutoPyTorch relies on the `SMAC `_ library to build individual models, which are later ensembled together using ensemble selection by `Caruana et al. (2004) `_. -Therefore, there are two main parts of the code: `AutoMLSMBO`, which is our interface to the SMAC package, and +Therefore, there are two main parts of the code: `AutoMLSMBO`, which acts as an interface to SMAC, and `EnsembleBuilder` which opportunistically builds an ensemble of the individual algorithms found by SMAC, at fixed intervals. The following sections provides details regarding this two main blocks of code. Building Individual Models ========================== -AutoPyTorch relies on Scikit-Learn `Pipeline `_ to build an individual algorithm. -In other words, each of the individual models fitted by SMAC are (and comply) with Scikit-Learn pipeline and framework. For example, when a pipeline is fitted, -we use pickle to save it to disk as stated `here `_. SMAC runs an optimization loop that proposes new -configurations based on bayesian optimization, which comply with the package `ConfigSpace `_. These configurations are -translated to a pipeline configuration, fitted and saved to disc using the function evaluator `ExecuteTaFuncWithQueue`. The latter is basically a worker that that -reads a dataset from disc, fits a pipeline, and collect the performance result which is communicated back to the main process via a Queue. This worker manages -resources using `Pynisher `_, and it usually does so by creating a new process. +AutoPyTorch relies on Scikit-Learn `Pipeline `_ to build an individual algorithm. + In other words, each of the individual models fitted by SMAC is (and comply) with Scikit-Learn pipeline and framework. -The Scikit-learn pipeline inherits from the `BaseEstimator `_, which implies that we have to honor the `Scikit-Learn development Guidelines `_. Of particular interest is that any estimator must define as attributes, the arguments that the class constructor receives (see `get_params and set_params` from the above documentation). +A pipeline can consist of various preprocessing steps including imputation, encoding, scaling, feature preprocessing, and algorithm setup and training. Regarding the training, Auto-PyTorch can fit 3 types of pipelines: a dummy pipeline, traditional classification pipelines, and PyTorch neural networks. The dummy pipeline builds on top of sklearn.dummy to construct an estimator that predicts using simple rules. This prediction is used as a baseline to define the worst-performing model that can be fit. Additionally, Auto-PyTorch fits traditional machine learning models (including LightGBM, CatBoost, RandomForest, ExtraTrees, K-Nearest-Neighbors, and Support Vector Machines) which are critical for small-sized datasets. The final type of machine learning pipeline corresponds to Neural Architecture Search of backbones (feature extraction) and network heads (for the final prediction). A pipeline might also contain additional training components like learning + rate scheduler, optimizers, and data loaders required to perform the neural architecture search. -Regarding multiprocessing, AutoPyTorch and SMAC work with `Dask.distributed `_. We only submits jobs to Dask up to the number of -workers, and wait for a worker to be available before continuing. +In the case of tabular classification/regression, the training data is preprocessed using scikit-learn.compose.ColumnTransformer on a per-column basis. The data preprocessing is dynamically created depending on the dataset properties. For example, on a dataset that only contains float-type features, no one-hot encoding is needed. Additionally, we wrap the ColumnTransformer via TabularColumnTransformer class to support torchvision transformation and handle column-reordering (Categorical columns are shifted to the left if one uses a ColumnTransformer). -At the end of a SMAC runs, the results will be available in the `temporary_directory` provided to the API, in particular inside of the `/smac3-output/run_/` directory. One can debug +When a pipeline is fitted, we use pickle to save it to disk as stated `here `_. SMAC runs an optimization loop that proposes new configurations based on Bayesian optimization, which comply with the package `ConfigSpace `_. These configurations are then translated to AutoPyTorch pipelines, fitted, and finally saved to disc using the function evaluator `ExecuteTaFuncWithQueue`. The latter is basically a worker that reads a dataset from disc, fits a pipeline, and collects the performance result which is communicated back to the main process via a Queue. This worker manages resources using `Pynisher `_, and it usually does so by creating a new process with a restricted memory (`memory_limit` API argument) and time constraints (`func_eval_time_limit_secs` API argument). + +The Scikit-learn pipeline inherits from the `BaseEstimator `_, which implies that we have to honor the `Scikit-Learn development Guidelines `_. Particularly, the arguments to the class constructor of any estimator must be defined as attributes of the class (see `get_params and set_params` from the above documentation). + +To speed up the search, AutoPyTorch and SMAC use `Dask.distributed `_ multiprocessing scheme. We only submits jobs to Dask.distributed.Client up to the number of +workers, and wait for a worker to be available before continuing searching for more pipelines. + +At the end of SMAC, the results will be available in the `temporary_directory` provided to the API run, in particular inside of the `/smac3-output/run_/` directory. One can debug the performance of the individual models using the file `runhistory.json` located in this area. Every individual model will be stored in `/.autoPyTorch/runs`. -In this later directory we store the fitted model (during cross-validation we store a single Voting Classifier/Regressor, which is the soft voting outcome of k-Fold cross-validation), the Out-Of-Fold +In this `runs` directory we store the fitted model (during cross-validation we store a single Voting Classifier/Regressor, which is the soft voting outcome of k-Fold cross-validation), the Out-Of-Fold predictions that are used to build an ensemble, and also the test predictions of this model in question. Building the ensemble model From 6c33465f8573f722744510de75a034b4afaa4ef5 Mon Sep 17 00:00:00 2001 From: chico Date: Wed, 7 Apr 2021 18:27:10 +0200 Subject: [PATCH 5/5] Delete extra point --- docs/dev.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/dev.rst b/docs/dev.rst index 80a66878a..b8bdab0c3 100644 --- a/docs/dev.rst +++ b/docs/dev.rst @@ -6,7 +6,7 @@ Developer Documentation ======================= -This documentation summarizes how the AutoPyTorch code works, and is meant as a guide for the developers to help contribute to it. . +This documentation summarizes how the AutoPyTorch code works, and is meant as a guide for the developers to help contribute to it. AutoPyTorch relies on the `SMAC `_ library to build individual models, which are later ensembled together using ensemble selection by `Caruana et al. (2004) `_.