Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffling feature order when changing seed #246

Closed
bkamins opened this issue Aug 11, 2023 · 5 comments
Closed

Shuffling feature order when changing seed #246

bkamins opened this issue Aug 11, 2023 · 5 comments

Comments

@bkamins
Copy link

bkamins commented Aug 11, 2023

Another issue following https://bkamins.github.io/julialang/2023/08/11/evotrees.html.

I have not checked your code in detail here, but it seems that when seed of rng is changed you do not shuffle the order of features in training and that the way the algorithm is implemented makes it sensitive to this order.

This is a minor issue, but in some cases (as variable importance) might matter.

(but I might be wrong here as I have not studied your source code carefully enough)

@jeremiedb
Copy link
Member

Thanks for bringing this, your post presents a very interesting take on impacts of colinearity on features influence within a model.

Something that first puzzled me in the post, was that if the seed was changed (or equivalently, fitting a model from the same model configration), the features importance changed. Such changes would be expected in situations where stochasticity is involved, that is with rowsample or colsample < 1. However, defaults are set to 1 in EvoTrees, so there should be no impact.
It indeed revealed that a sampling with replacement was used in the bins construction phase, which resulted in very small variations in the histogram cut points, but translated in meaningfull changes in the final features importance given the highly correlated data. I'll push a quick fix for this.

The behavior following fix (identical feat importance regardless of the seed):

 julia> config = EvoTreeRegressor(seed=123)
       m1 = fit_evotree(config,
           df;
           target_name="y",
           verbosity=0);
       EvoTrees.importance(m1)
9-element Vector{Pair{String, Float64}}:
 "x7" => 0.2796793167722336
 "x4" => 0.1574339532233146
 "x9" => 0.13160160270079996
 "x1" => 0.09600142781806717
 "x6" => 0.08616217430299715
 "x5" => 0.08468852035644867
 "x8" => 0.08437252198421051
 "x2" => 0.04563882666799334
 "x3" => 0.03442165617393493

julia> config = EvoTreeRegressor(seed=124)
       m2 = fit_evotree(config,
           df;
           target_name="y",
           verbosity=0);
       EvoTrees.importance(m2)
9-element Vector{Pair{String, Float64}}:
 "x7" => 0.2796793167722336
 "x4" => 0.1574339532233146
 "x9" => 0.13160160270079996
 "x1" => 0.09600142781806717
 "x6" => 0.08616217430299715
 "x5" => 0.08468852035644867
 "x8" => 0.08437252198421051
 "x2" => 0.04563882666799334
 "x3" => 0.03442165617393493

Note that the above deterministic behavior regardless of the seed is only applicable for "small" dataset, since the histogram building uses a maximum of 1000 * nbins. Hence, with 32 bins, we should start seeing impacts of different seeds on dataset larger than 32_000 observations.

Following the above fix, a remaining puzzling behavior was, as you pointed out, that the feature importance changes following a permutation of the features. This was unexpected to me given the previous considerations on default rowsample/colsample set to 1.
It turned out that the given the (extremely) high correlations between features, some tree splits result in the exactly the same gain for all 3 correlated features.

For example, see what happened at depth 3 for the original data, and the one where the features order is reverse (1:9 vs 1:9:-1):

# Original order: 
┌ Info: n: 6 | depth 3 | gains
│   findmax.((nodes[n]).gains) =9-element Vector{Tuple{Float64, Int64}}:
│     (278.45118764246104, 18)
│     (278.45118764246104, 18)
│     (278.45118764246104, 18)
│     (185.46830866636523, 5)
│     (185.10073579939998, 5)
│     (185.46830866636523, 5)
│     (225.197300057252, 27)
│     (225.197300057252, 27)
└     (225.197300057252, 27)

# reverse feature order
┌ Info: n: 6 | depth 3 | gains
│   findmax.((nodes[n]).gains) =9-element Vector{Tuple{Float64, Int64}}:
│     (225.197300057252, 27)
│     (225.197300057252, 27)
│     (225.197300057252, 27)
│     (185.46830866636523, 5)
│     (185.10073579939998, 5)
│     (185.46830866636523, 5)
│     (278.45118764246104, 18)
│     (278.45118764246104, 18)
└     (278.45118764246104, 18)

The consequence is that in the original order, feature "1" is kept, while in the reversed order, it's feature "7" (which corresponds to feature "3" in the original order). I think such behavior is reasonable, since in such scenario, it's simply choosing the first occurence between 3 fully identical options. If using a weaker δ correlation factor such as 1e-3 (instead of 1e-6), this behavior disappears and identical models are returned following a permutation of the features. I could see an approach where the the internal data is reordered based on the sort of the feature names, but I think this may introduce more ambiguity, especially considering that the "matrix" baesd API is really agnostic to features names and consistent col order is assumed.

On a sidenote, I think I've may have made sub-optimal default choices for model configuration, setting the number of trees to 10 and learning rate (eta) to 0.1. Such deafault was mean to be minimal configuration to result in quick test run, but not meant as meaningful default for real model construction as I expect user to perform a proper tuning of eta / nrounds using the early stopping utility. I'll considering changing the default to learning rate of 0.05 and nrounds to 100, which should result in more reasonable default models. Another aspect worth adding to the docs I suppose!

Also, gradient boosted tree models are naturally pretty apt at handling scenario situations of highly correlated feaures.
In a data setup like the one introduced in the bog post, using subsampling such as rowsample=0.5 and colsample=0.5 should provide significantly improved stability:

julia> config = EvoTreeRegressor(eta=0.05, nrounds=100, rowsample=0.5, colsample=0.5)
EvoTreeRegressor(
  nrounds = 100,
  lambda = 0.0, 
  gamma = 0.0,
  eta = 0.05,
  max_depth = 5,
  min_weight = 1.0,
  rowsample = 0.5,
  colsample = 0.5,
  nbins = 32,
  alpha = 0.5,
  monotone_constraints = Dict{Int64, Int64}(),
  tree_type = "binary",
  rng = TaskLocalRNG())

julia> m2 = fit_evotree(config,
           df;
           target_name="y",
           verbosity=0);

julia> EvoTrees.importance(m2)
9-element Vector{Pair{String, Float64}}:
 "x8" => 0.19702609197406498
 "x7" => 0.16997926459545526
 "x4" => 0.13904160998082676
 "x5" => 0.12416677372573705
 "x1" => 0.1082816368577555
 "x9" => 0.09631462559769513
 "x6" => 0.06807269121181694
 "x2" => 0.06422132633095977
 "x3" => 0.03289597972568853

julia> m2 = fit_evotree(config,
           df;
           target_name="y",
           verbosity=0);
       EvoTrees.importance(m2)
9-element Vector{Pair{String, Float64}}:
 "x8" => 0.20225689926677737
 "x7" => 0.19450162639346333
 "x5" => 0.12496839233185622
 "x4" => 0.1216585460049581
 "x1" => 0.09773613142312514
 "x2" => 0.08533914479794402
 "x6" => 0.08529129593026426
 "x9" => 0.0631971515339433
 "x3" => 0.025050812317668262

Such feature importance and effect attribution is particularly well-behaved compared for example to what is observed with a linear models where model aliasing results in factor values all over the place!

julia> using GLM
       x_train = Matrix(mat[1:9, :]')
       y_train = mat[10, :]
       lm(x_train, y_train)
LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}:

Coefficients:
───────────────────────────────────────────────────────────────
       Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────
x1   3.94359     3.89817   1.01    0.3117   -3.69762   11.5848
x2  -9.41359     3.88274  -2.42    0.0153  -17.0245    -1.80263
x3   5.86258     3.94299   1.49    0.1371   -1.86648   13.5916
x4   4.2         3.86762   1.09    0.2775   -3.38132   11.7813
x5  -2.45847     3.88596  -0.63    0.5270  -10.0757     5.15879
x6  -1.24553     3.92234  -0.32    0.7508   -8.93411    6.44305
x7   1.55798     3.89015   0.40    0.6888   -6.0675     9.18347
x8  -3.32843     3.92553  -0.85    0.3965  -11.0233     4.3664
x9   2.36631     3.89272   0.61    0.5433   -5.2642     9.99683

@bkamins
Copy link
Author

bkamins commented Aug 12, 2023

I agree I have chosen a corner case (on purpose - to show the readers the general issue + I wanted to promote EvoTrees.jl as I think it is excellent). Therefore feel free to proceed the way you prefer with the issues I have opened. Some detailed comments are below.

a sampling with replacement was used in the bins construction phase

This is what I assumed.

Note that the above deterministic behavior regardless of the seed is only applicable for "small" dataset, since the histogram building uses a maximum of 1000 * nbins. Hence, with 32 bins, we should start seeing impacts of different seeds on dataset larger than 32_000 observations.

I think it would be good to document this. In particular to document that except for this binning case the default settings are deterministic.

some tree splits result in the exactly the same gain for all 3 correlated features.

This was my design intention 😄. I think keeping it the way you do it is OK (assuming you document that the algorithm is deterministic as discussed above)

I think I've may have made sub-optimal default choices for model configuration

Indedd this is what puzzled me as xgboost uses different defaults.

In a data setup like the one introduced in the bog post, using subsampling such as rowsample=0.5 and colsample=0.5 should provide significantly improved stability:

Yes (and I tested this; especially colsample impacts this), but doing what I did in the post fit the point of my example better. In particular there are often cases in practice when you might get even 100% correlated features in a data set (e.g. two values that are lineary dependent - I have often seen such data in practice). So my test is not just a purely artificial exercise.

@bkamins
Copy link
Author

bkamins commented Aug 12, 2023

In summary - this issue can be closed given the discussion that you want deterministic behavior (but #245 and #247 are things that I think are more important to consider).

(or maybe you want to keep it open just to keep track of the discussion before you push a fix?)

@jeremiedb
Copy link
Member

jeremiedb commented Aug 12, 2023

Thanks for the clarifications. I've just pushed a PR that fixes the sampling for edges determination, as well as an update to default parameters: #250

I'll keep this issue open until I've added some meat in the docs regarding some the discussed clarifiacitons regarding stochasticity in the model as well as some comments on the default model parameters.

And thanks for showcasing EvoTrees in your blog :)

@jeremiedb jeremiedb mentioned this issue Aug 18, 2023
@jeremiedb
Copy link
Member

Adressed by #252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants