BoundsError on split_set_threads! #201

olivierlabayle · 2022-12-23T13:25:48Z

Hi,

I think I am facing an edge case where the tree split seems to result in a BoundsError. It has been quite tedious to come up with a reproducible example and it is not ideal since it originates from a large dataset (That I can probably share if needed). Due to the asynchronous fitting strategy of MLJ this is also hard to debug (I can't step into...). The line that throws an error is this one. Do you see any reason for which this could result in a BoundsError? I must also say that this error is stochastic since changing to rng = StableRNG(1234) for instance, does not raise.

The code and stacktrace are below (but you wont be able to reproduce without the dataset):

code:

using CSV, MLJ, DataFrames, MLJBase, EvoTrees
using StableRNGs

rng = StableRNG(123)

data = CSV.read("/Users/olivierlabayle/Downloads/pb_data.csv", DataFrame)
y = categorical(data.target)
X = data[!, Not(:target)]

evotree = EvoTreeClassifier(rng=rng)
ranges = [
    range(evotree, :max_depth, lower=5, upper=7), 
    range(evotree, :lambda, lower=1e-5, upper=10, scale=:log)
]
tuned_evotree = TunedModel(
    model=evotree,
    resampling=Holdout(shuffle=false, rng=rng),
    tuning=Grid(goal=10, rng=rng),
    range=ranges,
    measure=log_loss
    )

MLJBase.fit(tuned_evotree, 1, X, y)

stacktrace:

ERROR: BoundsError: attempt to access 335997-element Vector{UInt32} at index [335998:336120]
Stacktrace:
  [1] throw_boundserror(A::Vector{UInt32}, I::Tuple{UnitRange{Int64}})
    @ Base ./abstractarray.jl:703
  [2] checkbounds
    @ ./abstractarray.jl:668 [inlined]
  [3] view
    @ ./subarray.jl:177 [inlined]
  [4] split_set_threads!(out::Vector{UInt32}, left::Vector{UInt32}, right::Vector{UInt32}, is::SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}, x_bin::Matrix{UInt8}, feat::Int64, cond_bin::UInt8, offset::Int64)
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/find_split.jl:147
  [5] grow_tree!(tree::EvoTrees.Tree{EvoTrees.Softmax, 2, Float32}, nodes::Vector{EvoTrees.TrainNode{Float32, SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}}}, params::EvoTreeClassifier{EvoTrees.Softmax, Float32}, ∇::Matrix{Float32}, edges::Vector{Vector{Float32}}, js::Vector{UInt32}, out::Vector{UInt32}, left::Vector{UInt32}, right::Vector{UInt32}, x_bin::Matrix{UInt8}, monotone_constraints::Vector{Int32})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/fit.jl:229
  [6] grow_evotree!(evotree::EvoTree{EvoTrees.Softmax, 2, Float32}, cache::NamedTuple{(:info, :x, :y, :w, :K, :nodes, :pred, :is_in, :is_out, :mask, :js_, :js, :out, :left, :right, :∇, :edges, :x_bin, :monotone_constraints), Tuple{Dict{Symbol, Int64}, Matrix{Float32}, Vector{UInt32}, Vector{Float32}, Int64, Vector{EvoTrees.TrainNode{Float32, SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}}}, Matrix{Float32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt8}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Matrix{Float32}, Vector{Vector{Float32}}, Matrix{UInt8}, Vector{Int32}}}, params::EvoTreeClassifier{EvoTrees.Softmax, Float32})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/fit.jl:142
  [7] fit(model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, verbosity::Int64, A::NamedTuple{(:matrix, :names), Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, Vector{Symbol}}}, y::SubArray{CategoricalArrays.CategoricalValue{Bool, UInt32}, 1, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, Tuple{Vector{Int64}}, false}, w::Nothing)
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/MLJ.jl:9
  [8] fit(model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, verbosity::Int64, A::NamedTuple{(:matrix, :names), Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, Vector{Symbol}}}, y::SubArray{CategoricalArrays.CategoricalValue{Bool, UInt32}, 1, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, Tuple{Vector{Int64}}, false})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/MLJ.jl:2
  [9] fit_only!(mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}; rows::Vector{Int64}, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:680
 [10] #fit!#63
    @ ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:778 [inlined]
 [11] fit_and_extract_on_fold
    @ ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1180 [inlined]
 [12] (::MLJBase.var"#307#308"{MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, Int64})(k::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1019
 [13] mapreduce_first
    @ ./reduce.jl:419 [inlined]
 [14] _mapreduce(f::MLJBase.var"#307#308"{MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, Int64}, op::typeof(vcat), #unused#::IndexLinear, A::UnitRange{Int64})
    @ Base ./reduce.jl:430
 [15] _mapreduce_dim
    @ ./reducedim.jl:365 [inlined]
 [16] #mapreduce#765
    @ ./reducedim.jl:357 [inlined]
 [17] mapreduce
    @ ./reducedim.jl:357 [inlined]
 [18] _evaluate!(func::MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, #unused#::CPU1{Nothing}, nfolds::Int64, verbosity::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1018
 [19] evaluate!(mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{LogLoss{Float64}}, operations::Vector{typeof(predict)}, acceleration::CPU1{Nothing}, force::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1221
 [20] evaluate!(::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, ::Holdout, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{LogLoss{Float64}}, ::Vector{typeof(predict)}, ::CPU1{Nothing}, ::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1292
 [21] fit(::Resampler{Holdout}, ::Int64, ::DataFrame, ::CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}})
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1448
 [22] fit_only!(mach::Machine{Resampler{Holdout}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:680
 [23] #fit!#63
    @ ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:778 [inlined]
 [24] event!(metamodel::EvoTreeClassifier{EvoTrees.Softmax, Float32}, resampling_machine::Machine{Resampler{Holdout}, false}, verbosity::Int64, tuning::Grid, history::Nothing, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:436
 [25] #35
    @ ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:474 [inlined]
 [26] iterate
    @ ./generator.jl:47 [inlined]
 [27] _collect(c::Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, itr::Base.Generator{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, MLJTuning.var"#35#36"{Machine{Resampler{Holdout}, false}, Int64, Grid, Nothing, NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, ProgressMeter.Progress}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base ./array.jl:807
 [28] collect_similar
    @ ./array.jl:716 [inlined]
 [29] map
    @ ./abstractarray.jl:2933 [inlined]
 [30] assemble_events!(metamodels::Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, resampling_machine::Machine{Resampler{Holdout}, false}, verbosity::Int64, tuning::Grid, history::Nothing, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, acceleration::CPU1{Nothing})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:473
 [31] build!(history::Nothing, n::Int64, tuning::Grid, model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, model_buffer::Channel{Any}, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, verbosity::Int64, acceleration::CPU1{Nothing}, resampling_machine::Machine{Resampler{Holdout}, false})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:667
 [32] fit(::MLJTuning.ProbabilisticTunedModel{Grid, EvoTreeClassifier{EvoTrees.Softmax, Float32}}, ::Int64, ::DataFrame, ::CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:747
 [33] top-level scope
    @ ~/Dev/TARGENE/TargetedEstimation/sandbox.jl:23

The text was updated successfully, but these errors were encountered:

olivierlabayle · 2022-12-23T17:30:21Z

I've managed to reduce the example to the following, again let me know how to best share the dataset if you want to reproduce:

using CSV, DataFrames, MLJBase, EvoTrees
using StableRNGs

data = CSV.read("/Users/olivierlabayle/Downloads/pb_data.csv", DataFrame)
y = categorical(data.target)
X = data[!, Not(:target)]

train, test = MLJBase.train_test_pairs(Holdout(), 1:size(X, 1), X, y)[1]
rng = StableRNG(1)
model = EvoTreeClassifier(nround=100, lambda=1e-5, max_depth=7, rng=rng)
Xtrain, ytrain = MLJBase.reformat(model, selectrows(X, train), selectrows(y, train))
MLJBase.fit(model, 1, Xtrain, ytrain)

The issue arises because offset+length(is) at line 152 is bigger than the out size.

jeremiedb · 2022-12-23T19:01:48Z

Thanks for raising this! Could you confirm the EvoTrees's version you're using?
I suspect the bug to be tied with the new rowsamling approach introduced in v0.14, but if it also occurs on v0.13, it would change the diagnosis.

olivierlabayle · 2022-12-23T21:02:36Z

Yes this is with version v0.14.2. Out of curiosity I've tried v0.13 with 100 different random seeds and can't reproduce the bug so you are probably right!

jeremiedb · 2022-12-23T22:05:09Z

I tried various runs, including with StableRNG with various seeds, but I couldn't get any that generated a failure.
So if you're willing to share the data, that could be most helpful (jeremie.desgagne.bouchard @ gmail.com)

jeremiedb · 2022-12-25T20:17:57Z

@olivierlabayle Could you test current main branch?
I've pushed a fix that seems to resolve the bug you encountered.
I'll still need time to understand the root cause of the previous implementation spurious bugs, but current fix looks robust to all tests performed.

olivierlabayle · 2022-12-26T09:45:37Z

Thanks, I confirm I can't seem to be able to reproduce the bug on this dataset with main.

olivierlabayle · 2023-01-09T14:35:36Z

Did you manage to find the origin of the problem?

jeremiedb · 2023-01-09T23:18:33Z

Not yet! It's actually quite puzzling as I failed to reduce the issue to a simpler reproducible problem. I'm afraid it's unlikely someone will be willing to investigate based on a full EvoTrees training that just hit an issue at 5th iteration.

That being, I think there are some relevant cues to help contnue the investigation. Notably, bugs appears to creep in in the update_gains! function.

If running experiments/debug-softmax-split-cpu up to

EvoTrees.jl/experiments/debug-softmax-split-cpu.jl

Line 45 in d731db1

@time m_evo = fit_evotree(params_evo; x_train, y_train, x_eval = x_train, y_eval = y_train, metric=:mlogloss, print_every_n = 1);

, the following lines

EvoTrees.jl/src/find_split.jl

Lines 335 to 336 in d731db1

@info "minimum(hR[3,:,:])" minimum(hR[3, :, :])

@info "minimum(hR2[3,:,:])" minimum(hR2[3, :, :])

print the original (faulty) and new (cumsum) min values for the weights in each bin. Both values are the same as expected throughout the first iteration, but then start to diverge on the second iteration. It may be indicate something shoud should be initialized differently between iterations, but I couldn't identity anything problematic.
On the GPU side, the inclusion of the following else condition results in a failure to run the kernel:

EvoTrees.jl/src/gpu/find_split_gpu.jl

Lines 239 to 240 in d731db1

else

gains[bin, j] = 0

. This condition isn't necessary for the algo, but the fact that it fails may be symptomatic of the issue that also affect the CPU side.

My next step would be to try if I can be successful reproducing in a MWE the failure on the GPU side, so I can submit a relvant issue.

Let me know if there's something you'd like to investigate on your end.

olivierlabayle · 2023-01-10T15:32:46Z

Thank you for the feedback! I will try to investigate further the original issue since I think I've just managed to trigger the same error on a similar dataset with v0.13.1. Since I don't know the internals it might take me some time though.

jeremiedb · 2023-06-28T04:51:07Z

Have you encountered any new issue? With v0.15.0, I've paid closer attention to any numerical instabilities and found the new release to be realiable under all tested scenarios. I would therefore close given the significant revamp unless there're still scenarios leading to crashes.

olivierlabayle · 2023-07-06T14:01:42Z

Sorry it was just faster to move to XGBoost.jl, I think you can close this and I'll try again later when I have time

jeremiedb mentioned this issue Dec 25, 2022

Split fix #202

Merged

jeremiedb closed this as completed Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BoundsError on split_set_threads! #201

BoundsError on split_set_threads! #201

olivierlabayle commented Dec 23, 2022

olivierlabayle commented Dec 23, 2022

jeremiedb commented Dec 23, 2022

olivierlabayle commented Dec 23, 2022

jeremiedb commented Dec 23, 2022

jeremiedb commented Dec 25, 2022

olivierlabayle commented Dec 26, 2022

olivierlabayle commented Jan 9, 2023

jeremiedb commented Jan 9, 2023

olivierlabayle commented Jan 10, 2023

jeremiedb commented Jun 28, 2023

olivierlabayle commented Jul 6, 2023

BoundsError on split_set_threads! #201

BoundsError on split_set_threads! #201

Comments

olivierlabayle commented Dec 23, 2022

olivierlabayle commented Dec 23, 2022

jeremiedb commented Dec 23, 2022

olivierlabayle commented Dec 23, 2022

jeremiedb commented Dec 23, 2022

jeremiedb commented Dec 25, 2022

olivierlabayle commented Dec 26, 2022

olivierlabayle commented Jan 9, 2023

jeremiedb commented Jan 9, 2023

olivierlabayle commented Jan 10, 2023

jeremiedb commented Jun 28, 2023

olivierlabayle commented Jul 6, 2023