-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shuffling feature order when changing seed #246
Comments
Thanks for bringing this, your post presents a very interesting take on impacts of colinearity on features influence within a model. Something that first puzzled me in the post, was that if the seed was changed (or equivalently, fitting a model from the same model configration), the features importance changed. Such changes would be expected in situations where stochasticity is involved, that is with The behavior following fix (identical feat importance regardless of the seed): julia> config = EvoTreeRegressor(seed=123)
m1 = fit_evotree(config,
df;
target_name="y",
verbosity=0);
EvoTrees.importance(m1)
9-element Vector{Pair{String, Float64}}:
"x7" => 0.2796793167722336
"x4" => 0.1574339532233146
"x9" => 0.13160160270079996
"x1" => 0.09600142781806717
"x6" => 0.08616217430299715
"x5" => 0.08468852035644867
"x8" => 0.08437252198421051
"x2" => 0.04563882666799334
"x3" => 0.03442165617393493
julia> config = EvoTreeRegressor(seed=124)
m2 = fit_evotree(config,
df;
target_name="y",
verbosity=0);
EvoTrees.importance(m2)
9-element Vector{Pair{String, Float64}}:
"x7" => 0.2796793167722336
"x4" => 0.1574339532233146
"x9" => 0.13160160270079996
"x1" => 0.09600142781806717
"x6" => 0.08616217430299715
"x5" => 0.08468852035644867
"x8" => 0.08437252198421051
"x2" => 0.04563882666799334
"x3" => 0.03442165617393493 Note that the above deterministic behavior regardless of the seed is only applicable for "small" dataset, since the histogram building uses a maximum of 1000 * nbins. Hence, with 32 bins, we should start seeing impacts of different seeds on dataset larger than 32_000 observations. Following the above fix, a remaining puzzling behavior was, as you pointed out, that the feature importance changes following a permutation of the features. This was unexpected to me given the previous considerations on default rowsample/colsample set to 1. For example, see what happened at depth 3 for the original data, and the one where the features order is reverse (1:9 vs 1:9:-1): # Original order:
┌ Info: n: 6 | depth 3 | gains
│ findmax.((nodes[n]).gains) =
│ 9-element Vector{Tuple{Float64, Int64}}:
│ (278.45118764246104, 18)
│ (278.45118764246104, 18)
│ (278.45118764246104, 18)
│ (185.46830866636523, 5)
│ (185.10073579939998, 5)
│ (185.46830866636523, 5)
│ (225.197300057252, 27)
│ (225.197300057252, 27)
└ (225.197300057252, 27)
# reverse feature order
┌ Info: n: 6 | depth 3 | gains
│ findmax.((nodes[n]).gains) =
│ 9-element Vector{Tuple{Float64, Int64}}:
│ (225.197300057252, 27)
│ (225.197300057252, 27)
│ (225.197300057252, 27)
│ (185.46830866636523, 5)
│ (185.10073579939998, 5)
│ (185.46830866636523, 5)
│ (278.45118764246104, 18)
│ (278.45118764246104, 18)
└ (278.45118764246104, 18) The consequence is that in the original order, feature "1" is kept, while in the reversed order, it's feature "7" (which corresponds to feature "3" in the original order). I think such behavior is reasonable, since in such scenario, it's simply choosing the first occurence between 3 fully identical options. If using a weaker On a sidenote, I think I've may have made sub-optimal default choices for model configuration, setting the number of trees to 10 and learning rate (eta) to 0.1. Such deafault was mean to be minimal configuration to result in quick test run, but not meant as meaningful default for real model construction as I expect user to perform a proper tuning of eta / nrounds using the early stopping utility. I'll considering changing the default to learning rate of 0.05 and nrounds to 100, which should result in more reasonable default models. Another aspect worth adding to the docs I suppose! Also, gradient boosted tree models are naturally pretty apt at handling scenario situations of highly correlated feaures. julia> config = EvoTreeRegressor(eta=0.05, nrounds=100, rowsample=0.5, colsample=0.5)
EvoTreeRegressor(
nrounds = 100,
lambda = 0.0,
gamma = 0.0,
eta = 0.05,
max_depth = 5,
min_weight = 1.0,
rowsample = 0.5,
colsample = 0.5,
nbins = 32,
alpha = 0.5,
monotone_constraints = Dict{Int64, Int64}(),
tree_type = "binary",
rng = TaskLocalRNG())
julia> m2 = fit_evotree(config,
df;
target_name="y",
verbosity=0);
julia> EvoTrees.importance(m2)
9-element Vector{Pair{String, Float64}}:
"x8" => 0.19702609197406498
"x7" => 0.16997926459545526
"x4" => 0.13904160998082676
"x5" => 0.12416677372573705
"x1" => 0.1082816368577555
"x9" => 0.09631462559769513
"x6" => 0.06807269121181694
"x2" => 0.06422132633095977
"x3" => 0.03289597972568853
julia> m2 = fit_evotree(config,
df;
target_name="y",
verbosity=0);
EvoTrees.importance(m2)
9-element Vector{Pair{String, Float64}}:
"x8" => 0.20225689926677737
"x7" => 0.19450162639346333
"x5" => 0.12496839233185622
"x4" => 0.1216585460049581
"x1" => 0.09773613142312514
"x2" => 0.08533914479794402
"x6" => 0.08529129593026426
"x9" => 0.0631971515339433
"x3" => 0.025050812317668262 Such feature importance and effect attribution is particularly well-behaved compared for example to what is observed with a linear models where model aliasing results in factor values all over the place! julia> using GLM
x_train = Matrix(mat[1:9, :]')
y_train = mat[10, :]
lm(x_train, y_train)
LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}:
Coefficients:
───────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────
x1 3.94359 3.89817 1.01 0.3117 -3.69762 11.5848
x2 -9.41359 3.88274 -2.42 0.0153 -17.0245 -1.80263
x3 5.86258 3.94299 1.49 0.1371 -1.86648 13.5916
x4 4.2 3.86762 1.09 0.2775 -3.38132 11.7813
x5 -2.45847 3.88596 -0.63 0.5270 -10.0757 5.15879
x6 -1.24553 3.92234 -0.32 0.7508 -8.93411 6.44305
x7 1.55798 3.89015 0.40 0.6888 -6.0675 9.18347
x8 -3.32843 3.92553 -0.85 0.3965 -11.0233 4.3664
x9 2.36631 3.89272 0.61 0.5433 -5.2642 9.99683 |
I agree I have chosen a corner case (on purpose - to show the readers the general issue + I wanted to promote EvoTrees.jl as I think it is excellent). Therefore feel free to proceed the way you prefer with the issues I have opened. Some detailed comments are below.
This is what I assumed.
I think it would be good to document this. In particular to document that except for this binning case the default settings are deterministic.
This was my design intention 😄. I think keeping it the way you do it is OK (assuming you document that the algorithm is deterministic as discussed above)
Indedd this is what puzzled me as xgboost uses different defaults.
Yes (and I tested this; especially colsample impacts this), but doing what I did in the post fit the point of my example better. In particular there are often cases in practice when you might get even 100% correlated features in a data set (e.g. two values that are lineary dependent - I have often seen such data in practice). So my test is not just a purely artificial exercise. |
Thanks for the clarifications. I've just pushed a PR that fixes the sampling for edges determination, as well as an update to default parameters: #250 I'll keep this issue open until I've added some meat in the docs regarding some the discussed clarifiacitons regarding stochasticity in the model as well as some comments on the default model parameters. And thanks for showcasing EvoTrees in your blog :) |
Adressed by #252 |
Another issue following https://bkamins.github.io/julialang/2023/08/11/evotrees.html.
I have not checked your code in detail here, but it seems that when seed of rng is changed you do not shuffle the order of features in training and that the way the algorithm is implemented makes it sensitive to this order.
This is a minor issue, but in some cases (as variable importance) might matter.
(but I might be wrong here as I have not studied your source code carefully enough)
The text was updated successfully, but these errors were encountered: