This is a package for fast hyper-parameter tuning using noisy multi-fidelity tree-search for scikit-learn estimators (classifiers/regressors). Given ranges and types (categorical, integer, reals) for several hyper-parameters, this package is designed to search for a good configuration by treating the k-fold cross-validation errors under different settings and under different fidelities (levels of approximation based on amount of data used), as a multi-fidelity noisy black-box function. This work is based on the publications:
Please cite the above papers, when using this software in any publications/reports.
-
You will need a C and a Fortran compiler such as gnu95. Please install them by following the correct steps for your machine and OS.
-
You will need the following python packages: sklearn, numpy, brewer2mpl, scipy, matplotlib
- Go to the location of your choice and clone the repo.
git clone https://github.com/rajatsen91/MFTreeSearchCV.git
- Now the next steps are to setup the multi-fidelity enviroment.
- We will assume that the location of the project is
/home/MFTreeSearchCV/
- You first need to build the direct fortran library. For this
cd
into/home/MFTreeSearchCV/utils/direct_fortran
and runbash make_direct.sh
. You will need a fortran compiler such as gnu95. Once this is done, you can runsimple_direct_test.py
to make sure that it was installed correctly. - Edit the
set_up_gittins
file to change theGITTINS_PATH
to the local location of the repo.
GITTINS_PATH=/home/MFTreeSearchCV
- Run
source set_up_gittins
to set up all environment variables. - To test the installation, run
bash run_all_tests.sh
. Some of the tests are probabilistic and could fail at times. Some tests that require a scratch directory will fail, but nothing to worry about.
The usage of this repository is designed to be similar to parameter tuning functions in sklearn.model_selection, like gridsearchcv
. The main function is MFTreeSearchCV.MFTreeSearchCV
. The arguments and methods of this function are as follows,
"""Multi-Fidelity Tree Search over specified parameter ranges for an estimator.
Important members are fit, predict.
MFTreeSearchCV implements a "fit" and a "score" method.
It also implements "predict", "predict_proba" is they are present in the base-estimator.
The parameters of the estimator used to apply these methods are optimized
by cross-validated Tree Search over a parameter search space.
----------
estimator : estimator object.
This is assumed to implement the scikit-learn estimator interface.
Unlike grid search CV, estimator need not provide a ``score`` function.
Therefore ``scoring`` must be passed.
param_dict : Dictionary with parameters names (string) as keys and and the value is another dictionary. The value dictionary has
the keys 'range' that specifies the range of the hyper-parameter, 'type': 'int' or 'cat' or 'real' (integere, categorical or real),
and 'scale': 'linear' or 'log' specifying whether the search is done on a linear scale or a logarithmic scale. An example for param_dict
for scikit-learn SVC is as follows:
eg: param_dict = {'C' : {'range': [1e-2,1e2], 'type': 'real', 'scale': 'log'}, \
'kernel' : {'range': [ 'linear', 'poly', 'rbf', 'sigmoid'], 'type': 'cat'}, \
'degree' : {'range': [3,10], 'type': 'int', 'scale': 'linear'}}
scoring : string, callable, list/tuple, dict or None, default: None
A single string (see :ref:`scoring_parameter`). this must be specified as a string. See scikit-learn metrics
for more details.
fixed_params: dictionary of parameter values other than the once in param_dict, that should be held fixed at the supplied value.
For example, if fixed_params = {'nthread': 10} is passed with estimator as XGBoost, it means that all
XGBoost instances will be run with 10 parallel threads
cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a `(Stratified)KFold`,
debug : Binary
Controls the verbosity: True means more messages, while False only prints critical messages
refit : True means the best parameters are fit into an estimator and trained, while False means the best_estimator is not refit
fidelity_range : range of fidelity to use. It is a tuple (a,b) which means lowest fidelity means a samples are used for training and
validation and b samples are used when fidelity is the highest. We recommend setting b to be the total number of training samples
available and a to bea reasonable value.
n_jobs : number of parallel runs for the CV. Note that njobs * (number of threads used in the estimator) must be less than the number of threads
allowed in your machine. default value is 1.
nu_max : automatically set, but can be give a default values in the range (0,2]
rho_max : rho_max in the paper. Default value is 0.95 and is recommended
sigma : sigma in the paper. Default value is 0.02, adjust according to the believed noise standard deviation in the system
C : default is 1.0, which is overwritten if Auto = True, which is the recommended setting
Auto : If True then the bias function parameter C is auto set. This is recommended.
tol : default values is 1e-3. All fidelities z_1, z_2 such that |z_1 - z_2| < tol are assumed to yield the same bias value
total_budget : total budget for the search in seconds. This includes the time for automatic parameter C selection and does not include refit time.
total_budget should ideally be more than 5X the unit_cost which is the time taken to run one experiment at the highest fidelity
unit_cost : time in seconds required to fit the base estimator at the highest fidelity. This should be estimated by the user and then supplied.
Attributes
----------
cv_results_ : dictionary showing the scores attained under a few parameters setting. Each
parameter setting is the best parameter obtained from a tree-search call.
best_estimator_ : estimator or dict
Estimator that was chosen by the search, i.e. estimator
which gave highest score (or smallest loss if specified)
on the left out data. Not available if ``refit=False``.
See ``refit`` parameter for more information on allowed values.
best_score_ : float
Mean cross-validated score of the best_estimator
best_params_ : dict
Parameter setting that gave the best results on the hold out data.
refit_time_ : float
Seconds used for refitting the best model on the whole dataset.
This is present only if ``refit`` is not False.
fit_time_ : float
Seconds taken to find the best parameters. Should be close to the budget given.
"""
A functional example is provided in the ipython notebook Illustrate.ipynb
.
estimator = LogisticRegression() #base estimator
param_dict = {'C':{'range':[1e-5,1e5],'scale':'log','type':'real'},\
'penalty':{'range':['l1','l2'],'scale':'linear','type':'cat'}} #parameter space
fidelity_range = [500,15076] # fidelity range, lowest fidelity uses 500 samples while the highest one uses
#the whole dataset
n_jobs = 3 # number of jobs
cv = 3 # cv level
fixed_params = {}
scoring = 'accuracy'
t1 = time.time()
estimator = estimator.fit(X_train,y_train)
t2 = time.time()
total_budget = 500 # total budget in seconds
print('Time without CV: ', t2 - t1)
model = MFTreeSearchCV(estimator=estimator,param_dict=param_dict,scoring=scoring,\
fidelity_range=fidelity_range,unit_cost=None,\
cv=cv, n_jobs = n_jobs,total_budget=total_budget,debug = True,fixed_params=fixed_params)
## running in debug mode will display certain outputs
m = model.fit(X_train,y_train) # actual tree search will be done here
y_pred = m.predict(X_test) # the returned model m will have the best parameters fitted as 'refit = True'
accuracy = accuracy_score(y_pred,y_test) # this is the best accuracy obtained
print(m.best_params_) # will print the best params
print(m.cv_results_) # will print cv scores for some other parameters as well, that were close
- Rajat Sen
This project is licensed under the MIT License - see the LICENSE.md file for details
- We acknowledge the support from @kirthevasank for providing the original multi-fidelity black-box function code base.