Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion as other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
Gradient boosting decision tree has many popular implementations, such as lightgbm, xgboost, and catboost, etc. GBDT is a great tool for solving the problem of traditional machine learning problem. Since GBDT is a robust algorithm, it could use in many domains. The better hyper-parameters for GBDT, the better performance you could achieve.
NNI is a great platform for tuning hyper-parameters, you could try various builtin search algorithm in nni and run multiple trials concurrently.
There are many hyper-parameters in GBDT, but what kind of parameters will affect the performance or speed? Based on some practical experience, some suggestion here(Take lightgbm as example):
- For better accuracy
-
learning_rate
. The range oflearning rate
could be [0.001, 0.9]. -
num_leaves
.num_leaves
is related tomax_depth
, you don't have to tune both of them. -
bagging_freq
.bagging_freq
could be [1, 2, 4, 8, 10] -
num_iterations
. May larger if underfitting.
- For speed up
-
bagging_fraction
. The range ofbagging_fraction
could be [0.7, 1.0]. -
feature_fraction
. The range offeature_fraction
could be [0.6, 1.0]. -
max_bin
.
- To avoid overfitting
-
min_data_in_leaf
. This depends on your dataset. -
min_sum_hessian_in_leaf
. This depend on your dataset. -
lambda_l1
andlambda_l2
. -
min_gain_to_split
. -
num_leaves
.
Reference link: lightgbm and autoxgoboost
Now we come back to our example "auto-gbdt" which run in lightgbm and nni. The data including train data and test data. Given the features and label in train data, we train a GBDT regression model and use it to predict.
pip install lightgbm
pip install pandas
You need to prepare a basic code as following:
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
PARAMS = get_default_parameters()
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
If you like to tune num_leaves
, learning_rate
, bagging_fraction
and bagging_freq
, you could write a search_space.json as follow:
{
"num_leaves":{"_type":"choice","_value":[31, 28, 24, 20]},
"learning_rate":{"_type":"choice","_value":[0.01, 0.05, 0.1, 0.2]},
"bagging_fraction":{"_type":"uniform","_value":[0.7, 1.0]},
"bagging_freq":{"_type":"choice","_value":[1, 2, 4, 8, 10]}
}
More support variable type you could reference here.
+import nni
...
def get_default_parameters():
...
return params
def load_data(train_path='./data/regression.train', test_path='./data/regression.test'):
'''
Load or create dataset
'''
...
return lgb_train, lgb_eval, X_test, y_test
def run(lgb_train, lgb_eval, params, X_test, y_test):
# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=20,
valid_sets=lgb_eval,
early_stopping_rounds=5)
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
rmse = mean_squared_error(y_test, y_pred) ** 0.5
print('The rmse of prediction is:', rmse)
+ nni.report_final_result(rmse)
if __name__ == '__main__':
lgb_train, lgb_eval, X_test, y_test = load_data()
+ RECEIVED_PARAMS = nni.get_next_parameter()
PARAMS = get_default_parameters()
+ PARAMS.update(RECEIVED_PARAMS)
# train
run(lgb_train, lgb_eval, PARAMS, X_test, y_test)
In the config file, you could set some settings including:
- Experiment setting:
trialConcurrency
,maxExecDuration
,maxTrialNum
,trial gpuNum
, etc. - Platform setting:
trainingServicePlatform
, etc. - Path seeting:
searchSpacePath
,trial codeDir
, etc. - Algorithm setting: select
tuner
algorithm,tuner optimize_mode
, etc.
An config.yml as follow:
authorName: default
experimentName: example_auto-gbdt
trialConcurrency: 1
maxExecDuration: 10h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: minimize
trial:
command: python3 main.py
codeDir: .
gpuNum: 0
Run this experiment with command as follow:
nnictl create --config ./config.yml