Path smoothing #2950

btrotta · 2020-03-27T03:45:56Z

This implements the path smoothing idea proposed by @MaxHalford in #2790.

I implemented a slightly simpler version which does not use leaf depth. For each node (except the root), the output is calculated as ((n / s) * original_output + parent_output) / (n / s + 1) where n is number of data in leaf, s is a regularisation parameter (larger s means more smoothing), original_output is the unsmoothed output, and parent_output is the output of the parent node (which has itself been smoothed previously).

The reasoning behind this is that it's similar to the Bayesian calculation of the posterior mean given a prior expectation (see, e.g. Section 2 here: http://www.ams.sunysb.edu/~zhu/ams570/Bayesian_Normal.pdf). The posterior estimate of the true mean (in our case, the leaf output) is the weighted average (weighted by sample size and the sample and prior variance) of the sample and prior mean (the parent output). (This is more of an analogy than a real model, since the leaf output isn't actually the mean of a sample.) The parameter s represents the expected ratio of child node variance to parent node variance; small s means we expect the ratio to be small, i.e. the data is not noisy and the model can explain a lot of the variance, so we don't need much regularisation.

I haven't tested this much on real datasets. Based on the couple that I tried, it definitely improves regularisation, but doesn't seem to be dramatically different to existing regularisation methods. Given that it requires non-trivial code changes, it may not be worth the maintenance effort. Feel free to close this PR if so.

btrotta · 2020-03-27T03:48:15Z

Below is a basic script to test on the Boston housing data and the results. The script tries several different regularisation parameters (individually, not in combination) and selects the best value for each, then plots the results.

import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
import os
from sklearn import datasets

X, y = datasets.load_boston(True)
df = pd.DataFrame(np.concatenate([X, y[:, np.newaxis]], axis=1))
train_cols = df.columns[1:]
np.random.seed(0)
train_ind = np.random.choice(df.index, len(df) // 2, replace=False)
train_bool = df.index.isin(train_ind)
lgb_train = lgb.Dataset(df.loc[train_bool, train_cols], label=df.loc[train_bool, 0])
lgb_test = lgb.Dataset(df.loc[~train_bool, train_cols], label=df.loc[~train_bool, 0])
valid_sets = [lgb_train, lgb_test]
valid_names = ['train', 'valid']

# base
params = {'objective': 'regression_l2', 'seed': 0, 'num_leaves': 32, 'learning_rate': 0.01, 'metric': 'rmse',
          'lambda_l1': 0, 'lambda_l2': 0, 'min_data_in_leaf': 2, 'path_smooth': 0, 'subsample': 0.8,
          'feature_pre_filter': False}

# find best regularisation parameter
reg_param_names = ['min_data_in_leaf', 'path_smooth', 'lambda_l1', 'lambda_l2']
best_value = {p: None for p in reg_param_names}
best_res = {p: None for p in reg_param_names}
param_range = {p: [0, 1, 2, 4, 8, 16] for p in reg_param_names}
param_range['min_data_in_leaf'] = [2, 4, 8, 16]
for reg_param in reg_param_names:
    best_test_loss = np.inf
    for param_value in param_range[reg_param]:
        params[reg_param] = param_value
        res = {}
        est = lgb.train(params, lgb_train, valid_sets=valid_sets, valid_names=valid_names,
                        num_boost_round=100, evals_result=res)
        test_loss = res['valid']['rmse'][-1]
        if test_loss < best_test_loss:
            best_test_loss = test_loss
            best_value[reg_param] = param_value
            best_res[reg_param] = res
    params[reg_param] = 0
    if reg_param == 'min_data_in_leaf':
        params[reg_param] = 2


plt.figure()
color_list = ['b', 'y', 'r', 'c', 'm']
legend_list = []
for i, reg_param in enumerate(reg_param_names):
    res = best_res[reg_param]
    plt.plot(res['train']['rmse'], color=color_list[i])
    plt.plot(res['valid']['rmse'], color=color_list[i], linestyle=':')
    legend_list.append('{} = {} train'.format(reg_param, best_value[reg_param]))
    legend_list.append('{} = {} test'.format(reg_param, best_value[reg_param]))

# no regularisation
res = {}
params = {'objective': 'regression_l2', 'seed': 0, 'num_leaves': 32, 'learning_rate': 0.01, 'metric': 'rmse',
          'lambda_l1': 0, 'lambda_l2': 0, 'min_data_in_leaf': 2, 'path_smooth': 0, 'subsample': 0.8,
          'feature_pre_filter': False}
est = lgb.train(params, lgb_train, valid_sets=valid_sets, valid_names=valid_names,
                num_boost_round=100, evals_result=res)
i += 1
plt.plot(res['train']['rmse'], color=color_list[i])
plt.plot(res['valid']['rmse'], color=color_list[i], linestyle=':')
legend_list.append('no reg train')
legend_list.append('no reg test')
plt.legend(legend_list)

btrotta · 2020-03-28T02:50:45Z

I think I fixed the problem with the gpu version but CI is still failing, because of a problem with the R package. Not sure if this is due to my code changes or not...

jameslamb · 2020-03-28T03:10:52Z

I think I fixed the problem with the gpu version but CI is still failing, because of a problem with the R package. Not sure if this is due to my code changes or not...

@btrotta If you rebase to master, to get the changes from #2954 , the R test issues and intermittent linting issues should be resolved.

Sorry for the inconvenience!

StrikerRUS · 2020-03-28T22:20:00Z

@jameslamb We need your help here:

* checking installed package size ... NOTE
  installed size is  5.1Mb
  sub-directories of 1Mb or more:
    libs   4.5Mb

jameslamb · 2020-03-29T20:31:50Z

@jameslamb We need your help here:

* checking installed package size ... NOTE
  installed size is  5.1Mb
  sub-directories of 1Mb or more:
    libs   4.5Mb

ah! I checked the diff and this note is not a problem for now. This is mainly there to keep the hosting costs for CRAN down, so that people don't (for example) check large files with datasets into their packages. But the check is interesting because it's checked on the installed package, so in our case libs/ doesn't even exist in our source but is created to contain lib_lightgbm.so / lib_lightgbm.dll at install time.

I want to investigate this more, but it's not a problem caused by this PR. The NOTE is triggered by an installed package over 5.0 MB and I think we were juuuust under that on Mac and linux before. This is one of those weird NOTEs that CRAN will sometimes ignore if you explain it to them.

For now @btrotta , just change this line to 4 allowed notes: https://github.com/microsoft/LightGBM/blob/master/.ci/test_r_package.sh#L94.

This doesn't need to block this PR and I can address it with a more long-term answer in a separate PR. For what it's worth, I see it on #2936 s well.

btrotta · 2020-03-30T01:18:44Z

@jameslamb thanks, that fixed it! It's now failing because of a broken link in the docs (not related to this PR).

jameslamb · 2020-03-30T01:24:57Z

@jameslamb thanks, that fixed it! It's now failing because of a broken link in the docs (not related to this PR).

Ha sorry, we have a few annoying things happening in our CI this week. The check-docs tasks failing is being fixed in #2956. Thanks for being patient with us.

jameslamb · 2020-03-30T01:28:30Z

@jameslamb thanks, that fixed it! It's now failing because of a broken link in the docs (not related to this PR).

Ha sorry, we have a few annoying things happening in our CI this week. The check-docs tasks failing is being fixed in #2956. Thanks for being patient with us.

@btrotta as soon as I wrote that, I went to review that PR and saw that it could be merged! If you rebase to master again to get those changes, your PR should pass.

guolinke · 2020-04-06T06:16:51Z

include/LightGBM/config.h

+  // desc = if `path_smooth > 0` then `min_data_in_leaf` must be at least `2`.
+  // desc = larger values give stronger regularisation
+  // descl2 = the weight of each node is `(n / path_smooth) * w + w_p / (n / path_smooth + 1)`, where `n` is the number of samples in the node, `w` is the calculated node weight, and `w_p` is the weight of the parent node
+  double path_smooth = 0;


more details about w and w_p?

Added some more explanation, hope it's clearer now.

guolinke · 2020-04-06T06:22:01Z

src/treelearner/feature_histogram.hpp

      if (USE_MAX_OUTPUT) {
        if (max_delta_step > 0 && std::fabs(ret) > max_delta_step) {
-          return Common::Sign(ret) * max_delta_step;
+          ret = Common::Sign(ret) * max_delta_step;
        }
      }


it seems this could be moved outside too.

guolinke · 2020-04-06T06:27:48Z

src/treelearner/feature_histogram.hpp

+    double gain_shift;
+    if (USE_SMOOTHING) {
+      gain_shift = GetLeafGainGivenOutput<USE_L1>(
+          sum_gradient, sum_hessian, meta_->config->lambda_l1, meta_->config->lambda_l2, parent_output);


@btrotta could you double confirm here? if this is correct, could you add some detailed comments to make it clear?

You're right, this was not quite correct. Have made a slight change, let me know if it's still unclear.

guolinke · 2020-04-06T06:32:19Z

src/treelearner/feature_histogram.hpp

    }
+    if (USE_SMOOTHING) {
+      ret = ret * (num_data / smoothing) / (num_data / smoothing + 1) \


maybe only apply this when smoothing > kEpsilon ?

I have done the check for smoothing > kEpsilon outside this function (e.g. in FuncForNumricalL2 and FuncForCategoricalL1) , and I only pass the template parameter USE_SMOOTHING=true if smoothing > kEpsilon. Alternatively we could move the check inside CalculateSplittedLeafOutput, and then we wouldn't need the template parameter USE_SMOOTHING. This would be simpler but we would lose the speedup that the template provides.

USE_SMOOTHING may be useless for this function. the soothing output can be simplified as below。
weight for current output is w = n/(n+s) and ret = w * ret + (1-w) * parent_out. n must be greater zero. s=0 means no path smoothing.

guolinke · 2020-04-06T06:33:54Z

src/treelearner/feature_histogram.hpp

-                                meta_->config->max_delta_step);
+    double current_gain;
+    bool use_smoothing = meta_->config->path_smooth > kEpsilon;
+    if (use_smoothing) {


for these GatherInfoxxx functions, maybe you can use GetLeafGain<true, true, true>, and check `path_smooth > kEpsilon' inside?

See comment above.

okay, maybe we can make GatherInfoxxx to template function too. For example, template<bool USE_SMOOTHING> GatherInfoForThresholdNumericalInner. And check the use_smoothing in GatherInfoForThresholdNumerical. By this way, we can reduce these if ... else.

Changed as suggested.

guolinke · 2020-04-06T06:36:12Z

Thanks @btrotta !
I feel like that these template expansions in FeatureHistogram introduced by me are hard for development and review.
I will have another refactoring for them when have time.

stanpcf · 2020-04-08T12:37:17Z

src/treelearner/feature_histogram.hpp

-                                meta_->config->max_delta_step);
+    double current_gain;
+    bool use_smoothing = meta_->config->path_smooth > kEpsilon;
+    if (use_smoothing) {


there are lot of if(use_smoothing) in code.

if(use_smoothing){ func<true, true, true>(); }else { func<true, true, false>(); }

it's better use func<true, true, use_smoothing>(); instead

This gives a compiler error because the value of use_smoothing isn't known at compile time.

guolinke · 2020-04-09T01:46:19Z

src/treelearner/feature_histogram.hpp

+    double output_without_split = CalculateSplittedLeafOutput<USE_L1, USE_MAX_OUTPUT, USE_SMOOTHING>(
+        sum_gradient, sum_hessian, meta_->config->lambda_l1, meta_->config->lambda_l2,
+        meta_->config->max_delta_step, meta_->config->path_smooth, num_data, parent_output);
+    double gain_shift = GetLeafGainGivenOutput<USE_L1>(


may be a new function that CalculateSplittedLeafOutput then GetLeafGainGivenOutput ? it seems some other places also use this.

Actually I realised we can just use the existing function GetLeafGain here.

StrikerRUS · 2020-04-09T13:32:28Z

@jameslamb please take a look

Linting R code
Error: 'absolute_path_linter' is not an exported object from 'namespace:lintr'
Execution halted

StrikerRUS · 2020-04-09T13:36:51Z

@guolinke Can you please check your DM in Slack?

StrikerRUS · 2020-04-09T13:54:04Z

@jameslamb Created #2986 to not pollute this PR.

guolinke · 2020-04-11T08:58:58Z

@StrikerRUS Thanks!
@btrotta Thanks very much for your many contributions, would you like to join LightGBM project as a collaborator?

btrotta · 2020-04-11T11:44:49Z

@guolinke Yes that would be great! Thanks!

guolinke · 2020-04-25T04:24:32Z

it this ready to merge?

btrotta · 2020-04-26T01:34:26Z

@guolinke The CI is still failing, I'm not sure how to fix it.

jameslamb · 2020-04-26T01:47:08Z

@guolinke The CI is still failing, I'm not sure how to fix it.

@btrotta can you please merge in master to this PR? We have had several PRs to fix CI issues in the last few weeks.

btrotta · 2020-04-26T05:34:31Z

@jameslamb Thanks, that has fixed the problem from before. I also fixed some linting errors. But now I'm getting a new failure which wasn't happening previously:

.ci/test.sh: line 119: pytest: command not found
The command "bash .ci/test.sh" exited with 255.

StrikerRUS · 2020-04-26T12:07:47Z

But now I'm getting a new failure which wasn't happening previously:

It was a random network conection issue during conda installation:

('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))

I re-run that job and now everything is green.

StrikerRUS

Please fix some formatting errors (no formatting):

StrikerRUS · 2020-04-26T12:15:10Z

.ci/test_r_package.sh

@@ -98,7 +98,7 @@ if grep -q -R "WARNING" "$LOG_FILE_NAME"; then
    exit -1
 fi

-ALLOWED_CHECK_NOTES=2
+ALLOWED_CHECK_NOTES=4


@jameslamb Should be 3, right?

yes @btrotta can you please change to 3?

A lot has changed in our CI since this PR was first opened, sorry 😬

Suggested change

ALLOWED_CHECK_NOTES=4

ALLOWED_CHECK_NOTES=3

@jameslamb Is a similar increment needed for Windows?

LightGBM/.ci/test_r_package_windows.ps1

Line 102 in 2c18a0f

$ALLOWED_CHECK_NOTES = 3

No worries, fixed now.

ping @jameslamb for Windows updates

@StrikerRUS nothing is required on this PR since the Windows R tests are passing. We have it set to 3 and are generating exactly 3:

@jameslamb OK. I just was afraid that we can hit #2950 (comment) on Windows if let say the current size is 4.99 during some runs it can be calculated as 5, like in #2988.

If we do we can bump it back up and it wouldn't bother me. I just don't think this PR has to be the place where we deal with it. If CI is passing for the R package right now, nothing should be changed.

docs/Parameters-Tuning.rst

include/LightGBM/config.h

src/treelearner/feature_histogram.hpp

btrotta · 2020-04-27T12:01:50Z

@StrikerRUS thanks for your comments, I've fixed those issues now.

StrikerRUS

@btrotta

thanks for your comments, I've fixed those issues now.

Many thanks! Please address one more comment about inconsistency in naming of new parameter.

docs/Parameters-Tuning.rst

docs/Parameters.rst

include/LightGBM/config.h

jameslamb

Looks great to me, thanks @btrotta ! I'm going to leave just a Comment review...my review shouldn't count towards a merge.

github-actions · 2023-08-24T13:29:43Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Path smoothing

0050954

btrotta requested review from chivee, guolinke, jameslamb, Laurae2 and StrikerRUS as code owners March 27, 2020 03:45

Try to fix issue with gpu version.

198f87c

btrotta requested a review from huanzhang12 as a code owner March 27, 2020 05:06

Merge branch 'master' into path_smoothing

bf950b4

StrikerRUS added the feature label Mar 28, 2020

Fix failing CI for R package.

45def6c

Merge branch 'master' into path_smoothing

b7d37de

guolinke reviewed Apr 6, 2020

View reviewed changes

Minor fixes.

3a11182

stanpcf reviewed Apr 8, 2020

View reviewed changes

guolinke reviewed Apr 9, 2020

View reviewed changes

Minor refactor.

ab2ca81

btrotta added 2 commits April 9, 2020 17:29

Merge.

36276b4

Restore old code to get CI working.

e4a84da

btrotta added 2 commits April 26, 2020 13:50

Merge branch 'master' into path_smoothing

a777c5e

Fix style issues.

1a86a6b

StrikerRUS reviewed Apr 26, 2020

View reviewed changes

btrotta added 2 commits April 27, 2020 21:36

Fix ci for R package.

a77b509

Minor fixes for docs and code style.

239d90f

StrikerRUS reviewed Apr 28, 2020

View reviewed changes

docs/Parameters-Tuning.rst Outdated Show resolved Hide resolved

docs/Parameters.rst Outdated Show resolved Hide resolved

include/LightGBM/config.h Outdated Show resolved Hide resolved

Update docs.

eeb4b75

jameslamb reviewed Apr 29, 2020

View reviewed changes

StrikerRUS requested a review from guolinke April 30, 2020 13:28

guolinke approved these changes May 3, 2020

View reviewed changes

guolinke merged commit e50a915 into microsoft:master May 3, 2020

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Path smoothing #2950

Path smoothing #2950

Conversation

btrotta commented Mar 27, 2020

btrotta commented Mar 27, 2020

btrotta commented Mar 28, 2020

jameslamb commented Mar 28, 2020

StrikerRUS commented Mar 28, 2020

jameslamb commented Mar 29, 2020

btrotta commented Mar 30, 2020

jameslamb commented Mar 30, 2020

jameslamb commented Mar 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guolinke Apr 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guolinke Apr 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guolinke commented Apr 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS commented Apr 9, 2020

StrikerRUS commented Apr 9, 2020

StrikerRUS commented Apr 9, 2020

guolinke commented Apr 11, 2020

btrotta commented Apr 11, 2020

guolinke commented Apr 25, 2020

btrotta commented Apr 26, 2020

jameslamb commented Apr 26, 2020

btrotta commented Apr 26, 2020

StrikerRUS commented Apr 26, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

btrotta commented Apr 27, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023

guolinke Apr 6, 2020 •

edited

Loading

guolinke Apr 9, 2020 •

edited

Loading