Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path smoothing #2950

Merged
merged 14 commits into from
May 3, 2020
Merged

Path smoothing #2950

merged 14 commits into from
May 3, 2020

Conversation

btrotta
Copy link
Collaborator

@btrotta btrotta commented Mar 27, 2020

This implements the path smoothing idea proposed by @MaxHalford in #2790.

I implemented a slightly simpler version which does not use leaf depth. For each node (except the root), the output is calculated as ((n / s) * original_output + parent_output) / (n / s + 1) where n is number of data in leaf, s is a regularisation parameter (larger s means more smoothing), original_output is the unsmoothed output, and parent_output is the output of the parent node (which has itself been smoothed previously).

The reasoning behind this is that it's similar to the Bayesian calculation of the posterior mean given a prior expectation (see, e.g. Section 2 here: http://www.ams.sunysb.edu/~zhu/ams570/Bayesian_Normal.pdf). The posterior estimate of the true mean (in our case, the leaf output) is the weighted average (weighted by sample size and the sample and prior variance) of the sample and prior mean (the parent output). (This is more of an analogy than a real model, since the leaf output isn't actually the mean of a sample.) The parameter s represents the expected ratio of child node variance to parent node variance; small s means we expect the ratio to be small, i.e. the data is not noisy and the model can explain a lot of the variance, so we don't need much regularisation.

I haven't tested this much on real datasets. Based on the couple that I tried, it definitely improves regularisation, but doesn't seem to be dramatically different to existing regularisation methods. Given that it requires non-trivial code changes, it may not be worth the maintenance effort. Feel free to close this PR if so.

@btrotta
Copy link
Collaborator Author

btrotta commented Mar 27, 2020

Below is a basic script to test on the Boston housing data and the results. The script tries several different regularisation parameters (individually, not in combination) and selects the best value for each, then plots the results.

import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
import os
from sklearn import datasets

X, y = datasets.load_boston(True)
df = pd.DataFrame(np.concatenate([X, y[:, np.newaxis]], axis=1))
train_cols = df.columns[1:]
np.random.seed(0)
train_ind = np.random.choice(df.index, len(df) // 2, replace=False)
train_bool = df.index.isin(train_ind)
lgb_train = lgb.Dataset(df.loc[train_bool, train_cols], label=df.loc[train_bool, 0])
lgb_test = lgb.Dataset(df.loc[~train_bool, train_cols], label=df.loc[~train_bool, 0])
valid_sets = [lgb_train, lgb_test]
valid_names = ['train', 'valid']

# base
params = {'objective': 'regression_l2', 'seed': 0, 'num_leaves': 32, 'learning_rate': 0.01, 'metric': 'rmse',
          'lambda_l1': 0, 'lambda_l2': 0, 'min_data_in_leaf': 2, 'path_smooth': 0, 'subsample': 0.8,
          'feature_pre_filter': False}

# find best regularisation parameter
reg_param_names = ['min_data_in_leaf', 'path_smooth', 'lambda_l1', 'lambda_l2']
best_value = {p: None for p in reg_param_names}
best_res = {p: None for p in reg_param_names}
param_range = {p: [0, 1, 2, 4, 8, 16] for p in reg_param_names}
param_range['min_data_in_leaf'] = [2, 4, 8, 16]
for reg_param in reg_param_names:
    best_test_loss = np.inf
    for param_value in param_range[reg_param]:
        params[reg_param] = param_value
        res = {}
        est = lgb.train(params, lgb_train, valid_sets=valid_sets, valid_names=valid_names,
                        num_boost_round=100, evals_result=res)
        test_loss = res['valid']['rmse'][-1]
        if test_loss < best_test_loss:
            best_test_loss = test_loss
            best_value[reg_param] = param_value
            best_res[reg_param] = res
    params[reg_param] = 0
    if reg_param == 'min_data_in_leaf':
        params[reg_param] = 2


plt.figure()
color_list = ['b', 'y', 'r', 'c', 'm']
legend_list = []
for i, reg_param in enumerate(reg_param_names):
    res = best_res[reg_param]
    plt.plot(res['train']['rmse'], color=color_list[i])
    plt.plot(res['valid']['rmse'], color=color_list[i], linestyle=':')
    legend_list.append('{} = {} train'.format(reg_param, best_value[reg_param]))
    legend_list.append('{} = {} test'.format(reg_param, best_value[reg_param]))

# no regularisation
res = {}
params = {'objective': 'regression_l2', 'seed': 0, 'num_leaves': 32, 'learning_rate': 0.01, 'metric': 'rmse',
          'lambda_l1': 0, 'lambda_l2': 0, 'min_data_in_leaf': 2, 'path_smooth': 0, 'subsample': 0.8,
          'feature_pre_filter': False}
est = lgb.train(params, lgb_train, valid_sets=valid_sets, valid_names=valid_names,
                num_boost_round=100, evals_result=res)
i += 1
plt.plot(res['train']['rmse'], color=color_list[i])
plt.plot(res['valid']['rmse'], color=color_list[i], linestyle=':')
legend_list.append('no reg train')
legend_list.append('no reg test')
plt.legend(legend_list)

Figure_1

@btrotta
Copy link
Collaborator Author

btrotta commented Mar 28, 2020

I think I fixed the problem with the gpu version but CI is still failing, because of a problem with the R package. Not sure if this is due to my code changes or not...

@jameslamb
Copy link
Collaborator

I think I fixed the problem with the gpu version but CI is still failing, because of a problem with the R package. Not sure if this is due to my code changes or not...

@btrotta If you rebase to master, to get the changes from #2954 , the R test issues and intermittent linting issues should be resolved.

Sorry for the inconvenience!

@StrikerRUS
Copy link
Collaborator

@jameslamb We need your help here:

* checking installed package size ... NOTE
  installed size is  5.1Mb
  sub-directories of 1Mb or more:
    libs   4.5Mb

@jameslamb
Copy link
Collaborator

@jameslamb We need your help here:

* checking installed package size ... NOTE
  installed size is  5.1Mb
  sub-directories of 1Mb or more:
    libs   4.5Mb

ah! I checked the diff and this note is not a problem for now. This is mainly there to keep the hosting costs for CRAN down, so that people don't (for example) check large files with datasets into their packages. But the check is interesting because it's checked on the installed package, so in our case libs/ doesn't even exist in our source but is created to contain lib_lightgbm.so / lib_lightgbm.dll at install time.

I want to investigate this more, but it's not a problem caused by this PR. The NOTE is triggered by an installed package over 5.0 MB and I think we were juuuust under that on Mac and linux before. This is one of those weird NOTEs that CRAN will sometimes ignore if you explain it to them.

For now @btrotta , just change this line to 4 allowed notes: https://github.com/microsoft/LightGBM/blob/master/.ci/test_r_package.sh#L94.

This doesn't need to block this PR and I can address it with a more long-term answer in a separate PR. For what it's worth, I see it on #2936 s well.

@btrotta
Copy link
Collaborator Author

btrotta commented Mar 30, 2020

@jameslamb thanks, that fixed it! It's now failing because of a broken link in the docs (not related to this PR).

@jameslamb
Copy link
Collaborator

@jameslamb thanks, that fixed it! It's now failing because of a broken link in the docs (not related to this PR).

Ha sorry, we have a few annoying things happening in our CI this week. The check-docs tasks failing is being fixed in #2956. Thanks for being patient with us.

@jameslamb
Copy link
Collaborator

@jameslamb thanks, that fixed it! It's now failing because of a broken link in the docs (not related to this PR).

Ha sorry, we have a few annoying things happening in our CI this week. The check-docs tasks failing is being fixed in #2956. Thanks for being patient with us.

@btrotta as soon as I wrote that, I went to review that PR and saw that it could be merged! If you rebase to master again to get those changes, your PR should pass.

// desc = if `path_smooth > 0` then `min_data_in_leaf` must be at least `2`.
// desc = larger values give stronger regularisation
// descl2 = the weight of each node is `(n / path_smooth) * w + w_p / (n / path_smooth + 1)`, where `n` is the number of samples in the node, `w` is the calculated node weight, and `w_p` is the weight of the parent node
double path_smooth = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more details about w and w_p?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more explanation, hope it's clearer now.

Comment on lines 790 to 794
if (USE_MAX_OUTPUT) {
if (max_delta_step > 0 && std::fabs(ret) > max_delta_step) {
return Common::Sign(ret) * max_delta_step;
ret = Common::Sign(ret) * max_delta_step;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems this could be moved outside too.

double gain_shift;
if (USE_SMOOTHING) {
gain_shift = GetLeafGainGivenOutput<USE_L1>(
sum_gradient, sum_hessian, meta_->config->lambda_l1, meta_->config->lambda_l2, parent_output);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@btrotta could you double confirm here? if this is correct, could you add some detailed comments to make it clear?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this was not quite correct. Have made a slight change, let me know if it's still unclear.

}
if (USE_SMOOTHING) {
ret = ret * (num_data / smoothing) / (num_data / smoothing + 1) \
Copy link
Collaborator

@guolinke guolinke Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe only apply this when smoothing > kEpsilon ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done the check for smoothing > kEpsilon outside this function (e.g. in FuncForNumricalL2 and FuncForCategoricalL1) , and I only pass the template parameter USE_SMOOTHING=true if smoothing > kEpsilon. Alternatively we could move the check inside CalculateSplittedLeafOutput, and then we wouldn't need the template parameter USE_SMOOTHING. This would be simpler but we would lose the speedup that the template provides.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

USE_SMOOTHING may be useless for this function. the soothing output can be simplified as below。
weight for current output is w = n/(n+s) and ret = w * ret + (1-w) * parent_out. n must be greater zero. s=0 means no path smoothing.

meta_->config->max_delta_step);
double current_gain;
bool use_smoothing = meta_->config->path_smooth > kEpsilon;
if (use_smoothing) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these GatherInfoxxx functions, maybe you can use GetLeafGain<true, true, true>, and check `path_smooth > kEpsilon' inside?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above.

Copy link
Collaborator

@guolinke guolinke Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, maybe we can make GatherInfoxxx to template function too. For example, template<bool USE_SMOOTHING> GatherInfoForThresholdNumericalInner. And check the use_smoothing in GatherInfoForThresholdNumerical. By this way, we can reduce these if ... else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed as suggested.

@guolinke
Copy link
Collaborator

guolinke commented Apr 6, 2020

Thanks @btrotta !
I feel like that these template expansions in FeatureHistogram introduced by me are hard for development and review.
I will have another refactoring for them when have time.

meta_->config->max_delta_step);
double current_gain;
bool use_smoothing = meta_->config->path_smooth > kEpsilon;
if (use_smoothing) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are lot of if(use_smoothing) in code.

if(use_smoothing){
    func<true, true, true>();
}else {
    func<true, true, false>();
}

it's better use func<true, true, use_smoothing>(); instead

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives a compiler error because the value of use_smoothing isn't known at compile time.

double output_without_split = CalculateSplittedLeafOutput<USE_L1, USE_MAX_OUTPUT, USE_SMOOTHING>(
sum_gradient, sum_hessian, meta_->config->lambda_l1, meta_->config->lambda_l2,
meta_->config->max_delta_step, meta_->config->path_smooth, num_data, parent_output);
double gain_shift = GetLeafGainGivenOutput<USE_L1>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be a new function that CalculateSplittedLeafOutput then GetLeafGainGivenOutput ? it seems some other places also use this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I realised we can just use the existing function GetLeafGain here.

@StrikerRUS
Copy link
Collaborator

@jameslamb please take a look

Linting R code
Error: 'absolute_path_linter' is not an exported object from 'namespace:lintr'
Execution halted

@StrikerRUS
Copy link
Collaborator

@guolinke Can you please check your DM in Slack?

@StrikerRUS
Copy link
Collaborator

@jameslamb Created #2986 to not pollute this PR.

@guolinke
Copy link
Collaborator

@StrikerRUS Thanks!
@btrotta Thanks very much for your many contributions, would you like to join LightGBM project as a collaborator?

@btrotta
Copy link
Collaborator Author

btrotta commented Apr 11, 2020

@guolinke Yes that would be great! Thanks!

@guolinke
Copy link
Collaborator

it this ready to merge?

@btrotta
Copy link
Collaborator Author

btrotta commented Apr 26, 2020

@guolinke The CI is still failing, I'm not sure how to fix it.

@jameslamb
Copy link
Collaborator

@guolinke The CI is still failing, I'm not sure how to fix it.

@btrotta can you please merge in master to this PR? We have had several PRs to fix CI issues in the last few weeks.

@btrotta
Copy link
Collaborator Author

btrotta commented Apr 26, 2020

@jameslamb Thanks, that has fixed the problem from before. I also fixed some linting errors. But now I'm getting a new failure which wasn't happening previously:

.ci/test.sh: line 119: pytest: command not found
The command "bash .ci/test.sh" exited with 255.

@StrikerRUS
Copy link
Collaborator

But now I'm getting a new failure which wasn't happening previously:

It was a random network conection issue during conda installation:

('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))

I re-run that job and now everything is green.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix some formatting errors (no formatting):

image

@@ -98,7 +98,7 @@ if grep -q -R "WARNING" "$LOG_FILE_NAME"; then
exit -1
fi

ALLOWED_CHECK_NOTES=2
ALLOWED_CHECK_NOTES=4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Should be 3, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @btrotta can you please change to 3?

A lot has changed in our CI since this PR was first opened, sorry 😬

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ALLOWED_CHECK_NOTES=4
ALLOWED_CHECK_NOTES=3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Is a similar increment needed for Windows?

$ALLOWED_CHECK_NOTES = 3

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, fixed now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @jameslamb for Windows updates

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS nothing is required on this PR since the Windows R tests are passing. We have it set to 3 and are generating exactly 3:

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb OK. I just was afraid that we can hit #2950 (comment) on Windows if let say the current size is 4.99 during some runs it can be calculated as 5, like in #2988.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do we can bump it back up and it wouldn't bother me. I just don't think this PR has to be the place where we deal with it. If CI is passing for the R package right now, nothing should be changed.

docs/Parameters-Tuning.rst Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
src/treelearner/feature_histogram.hpp Outdated Show resolved Hide resolved
@btrotta
Copy link
Collaborator Author

btrotta commented Apr 27, 2020

@StrikerRUS thanks for your comments, I've fixed those issues now.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@btrotta

thanks for your comments, I've fixed those issues now.

Many thanks! Please address one more comment about inconsistency in naming of new parameter.

docs/Parameters-Tuning.rst Outdated Show resolved Hide resolved
docs/Parameters.rst Outdated Show resolved Hide resolved
include/LightGBM/config.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks @btrotta ! I'm going to leave just a Comment review...my review shouldn't count towards a merge.

@guolinke guolinke merged commit e50a915 into microsoft:master May 3, 2020
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants