LightGBM Model not being saved correctly #2517

Sanchita-P · 2019-10-18T06:31:42Z

I am saving my model by doing this:
best_gbm.save_model('best_gbm_raw_v2.1.txt', num_iteration=num_boost_round)
However, when I am going through the txt file, I see that num_iterations parameter is 100 which is the default parameter. This is incorrect since I am explicitly passing num_boost_round while saving and it is not 100. Apart from this, all the other parameters are being correctly saved. What could be causing this?
In case you'd like to see the training line:
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round) (best doesn't have num_iterations as a parameter)

Additionally, another quick question. Initially, when I was passing exactly the same parameters in the native lightgbm and sklearn api, I was getting exactly the same results. However, after making a few changes in the code, for eg adding num_boost_round instead of fixed number of iterations, the results for lightgbm and sklearn api are coming significantly different (feature importance and performance metrics). I can't figure out what's going wrong.

Here, is the line I am using to train both the models:

##Native LightGBM
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

##Sklearn's LightGBM
best_sk = dict(best)
del best_sk['min_gain_to_split']
sk_best_gbm = lgb.LGBMClassifier(**best_sk, n_estimators=num_boost_round, learning_rate=0.05, min_split_gain=best['min_gain_to_split'])
sk_best_gbm.fit(x_train, y_train)

PS: No matter the number of times I run the respective models, I get exactly the same results for them individually.

The text was updated successfully, but these errors were encountered:

guolinke · 2019-10-18T09:55:06Z

for the parameter logging error, refer to #2208, it is a known issue, but don't affect the usage.

For the second question, I don't understand it clearly. Did you mean that removing the min_gain_to_split in dict but adding it to the function argument results in the different result?

Sanchita-P · 2019-10-18T10:40:07Z

@guolinke Thank you for your response!
For the parameter logging issue, do you mean that the log will contain an incorrect number but will not affect any results after loading the saved model?

On passing the exact same parameters to Native LightGBM API and Sklearn API, I am getting different results. However, no matter the number of times I run the respective models, I get exactly the same results for them individually. Additionally, initially both Native LightGBM API and Sklearn API were giving the same results, but after minor changes in parameters and syntax, it's now not coming the same. Find below the code for reference:

x_train, x_test, y_train, y_test = train_test_split(df_dummy[df_merge.columns], labels, test_size=0.25,random_state=42)

n_folds = 5

lgb_train = lgb.Dataset(x_train, y_train)

def objective(params, n_folds = n_folds):
    """Objective function for Gradient Boosting Machine Hyperparameter Tuning"""
    
    params['objective'] = 'binary'
    print(params)

    params['max_depth'] = int(params['max_depth'])
    params['num_leaves'] = int(params['num_leaves'])

    params['min_child_samples'] = int(params['min_child_samples'])
    params['subsample_freq'] = int(params['subsample_freq'])

    # Perform n_fold cross validation with hyperparameters

    # Use early stopping and evalute based on ROC AUC
    cv_results = lgb.cv(params, lgb_train, nfold=n_folds, num_boost_round=10000, 
                        early_stopping_rounds=100, metrics='auc')
  
    # Extract the best score
    best_score = max(cv_results['auc-mean'])
    
    # Loss must be minimized
    loss = 1 - best_score
    num_iteration = int(np.argmax(cv_results['auc-mean']) + 1)
    
    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, params, num_iteration])
    
    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'status': STATUS_OK, 'estimators': num_iteration}

space = {
    'min_child_samples': hp.quniform('min_child_samples', 5, 100, 5), 
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.0),
    'max_depth' : hp.quniform('max_depth', 3, 10, 1),
    'subsample' : hp.quniform('subsample', 0.6, 1, 0.05),
    'num_leaves': hp.quniform('num_leaves', 20, 150, 1),  
    'subsample_freq': hp.quniform('subsample_freq',0,10,1),
    'min_gain_to_split': hp.quniform('min_gain_to_split', 0.01, 0.1, 0.01),

    'learning_rate' : hp.quniform('learning_rate', 0.05, 0.05, 0.05)                               
}

out_file = 'results/gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'estimators'])
of_connection.close()

trials = Trials()
best = fmin(objective, space, algo=tpe.suggest, trials=trials, max_evals=10)

#bayes_trials_results = sorted(trials.results, key = lambda x: x['loss'])
#bayes_trials_results[0]

results = pd.read_csv('results/gbm_trials.csv')

# Sort with best scores on top and reset index for slicing
results.sort_values('loss', ascending = True, inplace = True)
results.reset_index(inplace = True, drop = True)
results.head()
best_bayes_estimators = int(results.loc[0, 'estimators'])

best['max_depth'] = int(best['max_depth'])
best['num_leaves'] = int(best['num_leaves'])
best['min_child_samples'] = int(best['min_child_samples'])
num_boost_round=int(best_bayes_estimators * 1.1)
best['objective'] = 'binary'
best['boosting_type'] = 'gbdt'
best['subsample_freq'] = int(best['subsample_freq'])

np.random.RandomState(42)
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

#TESTING ON SKLEARN 

best_sk = dict(best)
del best_sk['min_gain_to_split']
sk_best_gbm = lgb.LGBMClassifier(**best_sk, n_estimators=num_boost_round, min_split_gain=best['min_gain_to_split'])
np.random.RandomState(42)
sk_best_gbm.fit(x_train, y_train)

sk_best_gbm.get_params()

guolinke · 2019-10-18T10:59:45Z

@Sanchita-P yeah, these loggings are just for recording the used parameters in experiments, it doesn't be loaded when loading model.

for the parameters consistency problem, you can try to use a new lgb_train before lgb.train, like

lgb_train = lgb.Dataset(x_train, y_train)
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

lgb_train is lazy-inited, and only inited one-time, so it will be constructed in the cv part. And some parameters (like min_child_samples) in that part may change the lgb_train. Therefore, lgb_train may is inited by different parameters. (So it is better to use a new lgb.train in cv part as well.)

As for the randomness, you can set seed parameter to get produce different results.

Sanchita-P · 2019-10-18T13:43:34Z

@guolinke Using a new lgb_train before lgb.train worked! Thank you so much, I have spent the last 15 hours trying to figure it out and was unable to find the solution. You're a savior :)

guolinke · 2019-10-21T16:20:08Z

I think this could be closed.

Sanchita-P added the bug label Oct 18, 2019

guolinke closed this as completed Oct 21, 2019

guolinke removed the bug label Oct 21, 2019

guolinke mentioned this issue Oct 31, 2019

[python package] lightgbm Dataset.save_binary() affects model train - (#2520) part 2 #2535

Closed

StrikerRUS mentioned this issue Nov 25, 2019

reset_param feature_contrib not fixed #2590

Closed

lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LightGBM Model not being saved correctly #2517

LightGBM Model not being saved correctly #2517

Sanchita-P commented Oct 18, 2019 •

edited

Loading

guolinke commented Oct 18, 2019

Sanchita-P commented Oct 18, 2019

guolinke commented Oct 18, 2019 •

edited

Loading

Sanchita-P commented Oct 18, 2019

guolinke commented Oct 21, 2019

LightGBM Model not being saved correctly #2517

LightGBM Model not being saved correctly #2517

Comments

Sanchita-P commented Oct 18, 2019 • edited Loading

guolinke commented Oct 18, 2019

Sanchita-P commented Oct 18, 2019

guolinke commented Oct 18, 2019 • edited Loading

Sanchita-P commented Oct 18, 2019

guolinke commented Oct 21, 2019

Sanchita-P commented Oct 18, 2019 •

edited

Loading

guolinke commented Oct 18, 2019 •

edited

Loading