Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBM Model not being saved correctly #2517

Closed
Sanchita-P opened this issue Oct 18, 2019 · 5 comments
Closed

LightGBM Model not being saved correctly #2517

Sanchita-P opened this issue Oct 18, 2019 · 5 comments

Comments

@Sanchita-P
Copy link

Sanchita-P commented Oct 18, 2019

I am saving my model by doing this:
best_gbm.save_model('best_gbm_raw_v2.1.txt', num_iteration=num_boost_round)
However, when I am going through the txt file, I see that num_iterations parameter is 100 which is the default parameter. This is incorrect since I am explicitly passing num_boost_round while saving and it is not 100. Apart from this, all the other parameters are being correctly saved. What could be causing this?
In case you'd like to see the training line:
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round) (best doesn't have num_iterations as a parameter)

Additionally, another quick question. Initially, when I was passing exactly the same parameters in the native lightgbm and sklearn api, I was getting exactly the same results. However, after making a few changes in the code, for eg adding num_boost_round instead of fixed number of iterations, the results for lightgbm and sklearn api are coming significantly different (feature importance and performance metrics). I can't figure out what's going wrong.

Here, is the line I am using to train both the models:

##Native LightGBM
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

##Sklearn's LightGBM
best_sk = dict(best)
del best_sk['min_gain_to_split']
sk_best_gbm = lgb.LGBMClassifier(**best_sk, n_estimators=num_boost_round, learning_rate=0.05, min_split_gain=best['min_gain_to_split'])
sk_best_gbm.fit(x_train, y_train)

PS: No matter the number of times I run the respective models, I get exactly the same results for them individually.

@Sanchita-P Sanchita-P added the bug label Oct 18, 2019
@guolinke
Copy link
Collaborator

for the parameter logging error, refer to #2208, it is a known issue, but don't affect the usage.

For the second question, I don't understand it clearly. Did you mean that removing the min_gain_to_split in dict but adding it to the function argument results in the different result?

@Sanchita-P
Copy link
Author

@guolinke Thank you for your response!
For the parameter logging issue, do you mean that the log will contain an incorrect number but will not affect any results after loading the saved model?

On passing the exact same parameters to Native LightGBM API and Sklearn API, I am getting different results. However, no matter the number of times I run the respective models, I get exactly the same results for them individually. Additionally, initially both Native LightGBM API and Sklearn API were giving the same results, but after minor changes in parameters and syntax, it's now not coming the same. Find below the code for reference:

x_train, x_test, y_train, y_test = train_test_split(df_dummy[df_merge.columns], labels, test_size=0.25,random_state=42)

n_folds = 5

lgb_train = lgb.Dataset(x_train, y_train)

def objective(params, n_folds = n_folds):
    """Objective function for Gradient Boosting Machine Hyperparameter Tuning"""
    
    params['objective'] = 'binary'
    print(params)

    params['max_depth'] = int(params['max_depth'])
    params['num_leaves'] = int(params['num_leaves'])

    params['min_child_samples'] = int(params['min_child_samples'])
    params['subsample_freq'] = int(params['subsample_freq'])

    # Perform n_fold cross validation with hyperparameters

    # Use early stopping and evalute based on ROC AUC
    cv_results = lgb.cv(params, lgb_train, nfold=n_folds, num_boost_round=10000, 
                        early_stopping_rounds=100, metrics='auc')
  
    # Extract the best score
    best_score = max(cv_results['auc-mean'])
    
    # Loss must be minimized
    loss = 1 - best_score
    num_iteration = int(np.argmax(cv_results['auc-mean']) + 1)
    
    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, params, num_iteration])
    
    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'status': STATUS_OK, 'estimators': num_iteration}

space = {
    'min_child_samples': hp.quniform('min_child_samples', 5, 100, 5), 
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.0),
    'max_depth' : hp.quniform('max_depth', 3, 10, 1),
    'subsample' : hp.quniform('subsample', 0.6, 1, 0.05),
    'num_leaves': hp.quniform('num_leaves', 20, 150, 1),  
    'subsample_freq': hp.quniform('subsample_freq',0,10,1),
    'min_gain_to_split': hp.quniform('min_gain_to_split', 0.01, 0.1, 0.01),

    'learning_rate' : hp.quniform('learning_rate', 0.05, 0.05, 0.05)                               
}

out_file = 'results/gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

writer.writerow(['loss', 'params', 'estimators'])
of_connection.close()

trials = Trials()
best = fmin(objective, space, algo=tpe.suggest, trials=trials, max_evals=10)

#bayes_trials_results = sorted(trials.results, key = lambda x: x['loss'])
#bayes_trials_results[0]

results = pd.read_csv('results/gbm_trials.csv')

# Sort with best scores on top and reset index for slicing
results.sort_values('loss', ascending = True, inplace = True)
results.reset_index(inplace = True, drop = True)
results.head()
best_bayes_estimators = int(results.loc[0, 'estimators'])

best['max_depth'] = int(best['max_depth'])
best['num_leaves'] = int(best['num_leaves'])
best['min_child_samples'] = int(best['min_child_samples'])
num_boost_round=int(best_bayes_estimators * 1.1)
best['objective'] = 'binary'
best['boosting_type'] = 'gbdt'
best['subsample_freq'] = int(best['subsample_freq'])

np.random.RandomState(42)
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

#TESTING ON SKLEARN 

best_sk = dict(best)
del best_sk['min_gain_to_split']
sk_best_gbm = lgb.LGBMClassifier(**best_sk, n_estimators=num_boost_round, min_split_gain=best['min_gain_to_split'])
np.random.RandomState(42)
sk_best_gbm.fit(x_train, y_train)

sk_best_gbm.get_params()

@guolinke
Copy link
Collaborator

guolinke commented Oct 18, 2019

@Sanchita-P yeah, these loggings are just for recording the used parameters in experiments, it doesn't be loaded when loading model.

for the parameters consistency problem, you can try to use a new lgb_train before lgb.train, like

lgb_train = lgb.Dataset(x_train, y_train)
best_gbm = lgb.train(params=best, train_set=lgb_train, num_boost_round=num_boost_round)

lgb_train is lazy-inited, and only inited one-time, so it will be constructed in the cv part. And some parameters (like min_child_samples) in that part may change the lgb_train. Therefore, lgb_train may is inited by different parameters. (So it is better to use a new lgb.train in cv part as well.)

As for the randomness, you can set seed parameter to get produce different results.

@Sanchita-P
Copy link
Author

@guolinke Using a new lgb_train before lgb.train worked! Thank you so much, I have spent the last 15 hours trying to figure it out and was unable to find the solution. You're a savior :)

@guolinke
Copy link
Collaborator

I think this could be closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants