Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to use Gaussian process regressor #1218

Open
wayneking517 opened this issue Jul 23, 2021 · 8 comments
Open

Error when trying to use Gaussian process regressor #1218

wayneking517 opened this issue Jul 23, 2021 · 8 comments

Comments

@wayneking517
Copy link

wayneking517 commented Jul 23, 2021

I hope that this question is allowed on this site.

I am trying to use pot to optimize a gaussian process regressor. As such, I need a custom config:

tpot_config = {
'kernel' : [1.0RBF(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0)),
1.0
RationalQuadratic(length_scale=0.5, alpha=0.1),
1.0ExpSineSquared(length_scale=0.5, periodicity=3.0,
length_scale_bounds=(1e-05, 100000.0),
periodicity_bounds=(1.0, 10.0)),
ConstantKernel(0.1, (0.01, 10.0))
(DotProduct(sigma_0=1.0, sigma_0_bounds=(0.1, 10.0)) ** 2),
1.0**2*Matern(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0),
nu=0.5)],
'alpha': [5e-9,1e-3, 1e-2, 1e-1, 1., 10., 100.],
'normalize_y' : [True, False],
'optimizer' : ['fmin_l_bfgs_b']
}
tpot = TPOTRegressor(generations=5,
population_size=50,
verbosity=2,
cv=5,
config_dict=tpot_config,
random_state=42)

When I launch the fit:
tpot.fit(X_train, y_train)

I get:

Warning: alpha is not available and will not be used by TPOT.
Warning: kernel is not available and will not be used by TPOT.
Warning: normalize_y is not available and will not be used by TPOT.
Warning: optimizer is not available and will not be used by TPOT.
/Users/lk/PycharmProjects/ANN/venv/lib/python3.7/site-packages/sklearn/utils/validation.py:63: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
return f(*args, **kwargs)

Generation 1 - Current best internal CV score: -inf
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "", line 1, in
File "/Users/lk/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 863, in fit
raise e
File "/Users/lk/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 854, in fit
self._update_top_pipeline()
File "/Users/lk/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 921, in _update_top_pipeline
sklearn_pipeline = self._toolbox.compile(expr=pipeline)
File "/Users/lk/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 1412, in _compile_to_sklearn
expr_to_tree(expr, self._pset), self.operators
File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/export_utils.py", line 365, in generate_pipeline_code
steps = _process_operator(pipeline_tree, operators)
File "/Users/lk/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/export_utils.py", line 401, in _process_operator
op_name = operator[0]
IndexError: list index out of range

@rachitk
Copy link

rachitk commented Jul 23, 2021

Hi @wayneking517. No worries - your question is absolutely alright to ask. In the future, when copying Python code, please use Github's code block by using three backticks to wrap your code as follows:

```
'sklearn.naive_bayes.BernoulliNB': {
'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
'fit_prior': [True, False]
},
```

This would be formatted as the following:

'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
},

This avoids issues where Github views part of your code as markdown (like italics and other formatting tags) and makes it hard to read.

From first glance, the primary issue is that you aren't defining your TPOT configuration in the appropriate manner - the way you've defined your configuration creates a dictionary with keys 'kernel', 'alpha', 'normalize_y', and 'optimizer' with their corresponding values.

However, TPOT uses a nested dictionary to define operators and their parameters. Each key of the dictionary should be the name of an operator for TPOT to import, and the values associated with that key should be a dictionary like the one you have passed. Your dictionary is just the set of parameters but without the operator as a key - this makes TPOT think each of your parameters is an operator and it attempts to import them. As they are not valid operators, TPOT fails to import them and then errors due to having an empty operator set.

You can see an example of a full TPOT configuration dictionary here for information on how the formatting should work: https://github.com/EpistasisLab/tpot/blob/master/tpot/config/regressor.py

Assuming you want to use this regressor as the only operator in TPOT: https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html, you can define a configuration dictionary in the vein of as follows (though please check the parameters as I had to infer some parts of your code due to Github mangling the formatting):

tpot_config = {
    'sklearn.gaussian_process.GaussianProcessRegressor': {
          'kernel': [1.0*RBF(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0)),
                          1.0*RationalQuadratic(length_scale=0.5, alpha=0.1),
                          1.0*ExpSineSquared(length_scale=0.5, periodicity=3.0, length_scale_bounds=(1e-05, 100000.0), periodicity_bounds=(1.0, 10.0)),
                          ConstantKernel(0.1, (0.01, 10.0))*(DotProduct(sigma_0=1.0, sigma_0_bounds=(0.1, 10.0)) ** 2),
                          1.0**2*Matern(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0),nu=0.5)],
          'alpha': [5e-9, 1e-3, 1e-2, 1e-1, 1., 10., 100.],
          'normalize_y': [True, False],
          'optimizer': ['fmin_l_bfgs_b']
    },
}

If you want to add additional operators, you need to continue building the dictionary by adding another element with the name of that operator as the key having a value of a dictionary of parameters with their associated values.

Let me know if you have any additional questions or concerns.

@wayneking517
Copy link
Author

wayneking517 commented Jul 24, 2021

Thank you for you answer. I found the tpot_config file example at https://www.gitmemory.com/issue/EpistasisLab/tpot/1186/801378125.

I have made the recommended changes.

tpot_config = {
    'sklearn.gaussian_process.GaussianProcessRegressor': {
          'kernel': [1.0*RBF(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0)),
                          1.0*RationalQuadratic(length_scale=0.5, alpha=0.1),
                          1.0*ExpSineSquared(length_scale=0.5, periodicity=3.0, length_scale_bounds=(1e-05, 100000.0), periodicity_bounds=(1.0, 10.0)),
                          ConstantKernel(0.1, (0.01, 10.0))*(DotProduct(sigma_0=1.0, sigma_0_bounds=(0.1, 10.0)) ** 2),
                          1.0**2*Matern(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0),nu=0.5)],
          'alpha': [5e-9, 1e-3, 1e-2, 1e-1, 1., 10., 100.],
          'normalize_y': [True, False],
          'optimizer': ['fmin_l_bfgs_b']
    },
}

I now get "NameError: name 'RBF' is not defined". If I remove RBF, I get 'DotProduct' is not defined. If I remove 'DotProduct', i get TypeError: Cannot clone object '0.099856' (type <class 'float'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' method.

Also, if I use tpot_config do I have to include my own Preprocessors or will the built in ones be used?

Once again, thank you for tolerating these rookie questions.

@rachitk
Copy link

rachitk commented Jul 27, 2021

Sorry for the delay in responding. You are likely getting these errors because you have not imported the required functions to call RBF, DotProduct, etc. You need to import these from sklearn (include from sklearn.gaussian_process.kernels import WhiteKernel, Matern, RBF, DotProduct, RationalQuadratic, ExpSineSquared in your imports) or wherever you would like to import these functions from.

We did not develop or test this config file found in #1186 and #1191 - it seems that this was a custom one made by another user that has not been merged into TPOT because it has many build issues and errors in its current state. You will probably need to tweak the configuration appropriately to get things to work. Let us know if importing the needed functions resolves the issues or provides a different error message.

If you define your own TPOT configuration, it will only use the operators from the configuration you pass to TPOT. The configuration above will make the GaussianProcessRegressor the only operator that TPOT considers.

If you wish to include other operators in TPOT's search space, you will need to define them and their operators in the same manner you did the GaussianProcessRegressor (which you can do by copying the dictionary from a default configuration or by appending your dictionary to one of the default configurations).

@wayneking517
Copy link
Author

Here is my code:

from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel, RationalQuadratic, \
    Exponentiation, RBF, ExpSineSquared, DotProduct
from sklearn.gaussian_process import GaussianProcessRegressor


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.22, random_state=2)


tpot_config = {
    'sklearn.gaussian_process.GaussianProcessRegressor': {
          'kernel': [1.0*RBF(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0)),
                          1.0*RationalQuadratic(length_scale=0.5, alpha=0.1),
                          1.0*ExpSineSquared(length_scale=0.5, periodicity=3.0, length_scale_bounds=(1e-05, 100000.0), periodicity_bounds=(1.0, 10.0)),
                          ConstantKernel(0.1, (0.01, 10.0))*(DotProduct(sigma_0=1.0, sigma_0_bounds=(0.1, 10.0)) ** 2),
                          1.0**2*Matern(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0),nu=0.5)],
          'alpha': [5e-9, 1e-3, 1e-2, 1e-1, 1., 10., 100.],
          'normalize_y': [True, False],
          'optimizer': ['fmin_l_bfgs_b']
    },
}

Here is the error:

Generation 1 - Current best internal CV score: -inf
Traceback (most recent call last):
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 828, in fit
    log_file=self.log_file_,
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/gp_deap.py", line 281, in eaMuPlusLambda
    per_generation_function(gen)
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 1176, in _check_periodic_pipeline
    self._update_top_pipeline()
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 921, in _update_top_pipeline
    sklearn_pipeline = self._toolbox.compile(expr=pipeline)
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 1414, in _compile_to_sklearn
    sklearn_pipeline = eval(sklearn_pipeline_str, self.operators_context)
  File "<string>", line 2, in <module>
NameError: name 'RBF' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/lindaking/PycharmProjects/ANN/Hahne_TPot.py", line 57, in <module>
    tpot.fit(X_train, y_train)
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 863, in fit
    raise e
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 854, in fit
    self._update_top_pipeline()
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 921, in _update_top_pipeline
    sklearn_pipeline = self._toolbox.compile(expr=pipeline)
  File "/Users/lindaking/PycharmProjects/ANN/venv/lib/python3.7/site-packages/tpot/base.py", line 1414, in _compile_to_sklearn
    sklearn_pipeline = eval(sklearn_pipeline_str, self.operators_context)
  File "<string>", line 2, in <module>
NameError: name 'RBF' is not defined

Process finished with exit code 1

@rachitk
Copy link

rachitk commented Jul 28, 2021

@wayneking517 I see the issue - I had assumed RBF and the other kernel functions returned simple arrays/lists, which TPOT can properly handle when evaluating pipelines. However, they return kernel type objects from sklearn, and passing in arbitrary objects as hyperparameter options is not currently supported using TPOT's current methods.

There is a workaround for this, which involves appending the contexts needed for the kernels to be evaluated properly to TPOT's operator contexts, which I've provided a demo of below:

from tpot import TPOTRegressor

import sklearn
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

from sklearn.gaussian_process.kernels import Matern, WhiteKernel, ConstantKernel, RationalQuadratic, \
    Exponentiation, RBF, ExpSineSquared, DotProduct
from sklearn.gaussian_process import GaussianProcessRegressor


X, y = make_regression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.22, random_state=2)

gaussian_config = {
    'sklearn.gaussian_process.GaussianProcessRegressor': {
          'kernel': [1.0*RBF(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0)),
                          1.0*RationalQuadratic(length_scale=0.5, alpha=0.1),
                          1.0*ExpSineSquared(length_scale=0.5, periodicity=3.0, length_scale_bounds=(1e-05, 100000.0), periodicity_bounds=(1.0, 10.0)),
                          ConstantKernel(0.1, (0.01, 10.0))*(DotProduct(sigma_0=1.0, sigma_0_bounds=(0.1, 10.0)) ** 2),
                          1.0**2*Matern(length_scale=0.5, length_scale_bounds=(1e-05, 100000.0),nu=0.5)],
          'alpha': [5e-9, 1e-3, 1e-2, 1e-1, 1., 10., 100.],
          'normalize_y': [True, False],
          'optimizer': ['fmin_l_bfgs_b']
    },
}

kernel_dict = {
	"RBF": RBF,
    "RationalQuadratic": RationalQuadratic,
    "ExpSineSquared": ExpSineSquared,
    "ConstantKernel": ConstantKernel,
    "DotProduct": DotProduct,
    "Matern": Matern,
}

tpot_obj = TPOTRegressor(generations=5,
	population_size=50,
	verbosity=2,
	cv=5,
	config_dict=gaussian_config,
	random_state=42)

tpot_obj._fit_init()
tpot_obj.operators_context.update(kernel_dict)
tpot_obj.warm_start = True

tpot_obj.fit(X_train, y_train)

Modify the line defining X, y to the line relevant to your pipeline.

If you are interested in the technical details of what we are doing in the demo above: essentially, we are creating a context dictionary for the kernels (RBF, RationalQuadratic, etc.) so that TPOT knows where to look for them if it finds them in a pipeline string. We then instantiate TPOT and call its internal _fit_init function to set the existing operator contexts (with some special functions that TPOT uses to construct pipelines) and then append the kernel context dictionary we made earlier to that operator context. We set "warm_start" to true simply to indicate that we have already called _fit_init() and to prevent TPOT from overwriting the context dictionary we have set, and then we run TPOT.

This demo will allow TPOT to construct a pipeline, though the only operator in its search space with the config above would be the GaussianProcessRegressor. To add more operators, you would need to define more operators in the dictionary you pass in or append this to another config dictionary, which you could do by importing an existing configuration dictionary, appending it to your custom configuration for the Gaussian Process Regressor, and then passing the combined configuration dictionary to TPOT:

from tpot.config.regressor import regressor_config_dict

gaussian_config.update(regressor_config_dict)

In the future, we will look into the possibility of including a possible parameter to pass additional contexts to TPOT to support applications like this to further improve TPOT. Thank you for raising this issue and bringing this to our attention!

Hopefully this helps with what you are aiming to do. Please let me know if you have any additional questions!

@wayneking517
Copy link
Author

Thank you. To get this to work, I had to replace each instance of "guassian_config" with "c" (our choose your own variable.)

# Average CV score on the training set was: -0.4411353932258676
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=GaussianProcessRegressor(alpha=1.0, kernel=1**2 * Matern(length_scale=0.5, nu=0.5), normalize_y=False, optimizer="fmin_l_bfgs_b")),
    StackingEstimator(estimator=GaussianProcessRegressor(alpha=100.0, kernel=1**2 * ExpSineSquared(length_scale=0.5, periodicity=3), normalize_y=False, optimizer="fmin_l_bfgs_b")),
    GaussianProcessRegressor(alpha=5e-09, kernel=0.316**2 * DotProduct(sigma_0=1) ** 2, normalize_y=True, optimizer="fmin_l_bfgs_b")

What is StackingEstimator?

@rachitk
Copy link

rachitk commented Jul 28, 2021

Apologies - I misspelled the first instance of "gaussian_config" as "guassian_config" in the demo above. I've corrected the misspelling above, but this should fix the issue with needing to replace the variable names with another variable (since that would make sure they are the same).

To answer your question: StackingEstimator is a built-in TPOT operator that will take an estimator (a classifier or regressor; in this case, your GaussianProcessRegressor) and append the results of that regressor to the feature set as a synthetic feature.

In essence, the pipeline here is generating a regression output from the first regressor and appending that as a synthetic feature to the features, then passing the combination to the second regressor which generates its own regression output that is also appended to the features, then finally the total set of features (consisting of your input + 2 synthetic features from the prior regressors) is predicted on by the final regressor.

@wayneking517
Copy link
Author

I've tried to tune up the GP model using TPot.

The optimized model is:

exported_pipeline = GaussianProcessRegressor(alpha=0.01, kernel=1**2 * ExpSineSquared(length_scale=0.5, periodicity=3), normalize_y=False, optimizer="fmin_l_bfgs_b")

When I invoke the regressor in my code, I get:

17**2 * ExpSineSquared(length_scale=0.000764, periodicity=0.000302)
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/gaussian_process/_gpr.py:504: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)

I am wondering if you might know what Tpot does to avoid this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants