Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoundsError on split_set_threads! #201

Closed
olivierlabayle opened this issue Dec 23, 2022 · 11 comments
Closed

BoundsError on split_set_threads! #201

olivierlabayle opened this issue Dec 23, 2022 · 11 comments

Comments

@olivierlabayle
Copy link

Hi,

I think I am facing an edge case where the tree split seems to result in a BoundsError. It has been quite tedious to come up with a reproducible example and it is not ideal since it originates from a large dataset (That I can probably share if needed). Due to the asynchronous fitting strategy of MLJ this is also hard to debug (I can't step into...). The line that throws an error is this one. Do you see any reason for which this could result in a BoundsError? I must also say that this error is stochastic since changing to rng = StableRNG(1234) for instance, does not raise.

The code and stacktrace are below (but you wont be able to reproduce without the dataset):

code:

using CSV, MLJ, DataFrames, MLJBase, EvoTrees
using StableRNGs

rng = StableRNG(123)

data = CSV.read("/Users/olivierlabayle/Downloads/pb_data.csv", DataFrame)
y = categorical(data.target)
X = data[!, Not(:target)]

evotree = EvoTreeClassifier(rng=rng)
ranges = [
    range(evotree, :max_depth, lower=5, upper=7), 
    range(evotree, :lambda, lower=1e-5, upper=10, scale=:log)
]
tuned_evotree = TunedModel(
    model=evotree,
    resampling=Holdout(shuffle=false, rng=rng),
    tuning=Grid(goal=10, rng=rng),
    range=ranges,
    measure=log_loss
    )

MLJBase.fit(tuned_evotree, 1, X, y)

stacktrace:

ERROR: BoundsError: attempt to access 335997-element Vector{UInt32} at index [335998:336120]
Stacktrace:
  [1] throw_boundserror(A::Vector{UInt32}, I::Tuple{UnitRange{Int64}})
    @ Base ./abstractarray.jl:703
  [2] checkbounds
    @ ./abstractarray.jl:668 [inlined]
  [3] view
    @ ./subarray.jl:177 [inlined]
  [4] split_set_threads!(out::Vector{UInt32}, left::Vector{UInt32}, right::Vector{UInt32}, is::SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}, x_bin::Matrix{UInt8}, feat::Int64, cond_bin::UInt8, offset::Int64)
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/find_split.jl:147
  [5] grow_tree!(tree::EvoTrees.Tree{EvoTrees.Softmax, 2, Float32}, nodes::Vector{EvoTrees.TrainNode{Float32, SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}}}, params::EvoTreeClassifier{EvoTrees.Softmax, Float32}, ∇::Matrix{Float32}, edges::Vector{Vector{Float32}}, js::Vector{UInt32}, out::Vector{UInt32}, left::Vector{UInt32}, right::Vector{UInt32}, x_bin::Matrix{UInt8}, monotone_constraints::Vector{Int32})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/fit.jl:229
  [6] grow_evotree!(evotree::EvoTree{EvoTrees.Softmax, 2, Float32}, cache::NamedTuple{(:info, :x, :y, :w, :K, :nodes, :pred, :is_in, :is_out, :mask, :js_, :js, :out, :left, :right, :∇, :edges, :x_bin, :monotone_constraints), Tuple{Dict{Symbol, Int64}, Matrix{Float32}, Vector{UInt32}, Vector{Float32}, Int64, Vector{EvoTrees.TrainNode{Float32, SubArray{UInt32, 1, Vector{UInt32}, Tuple{UnitRange{Int64}}, true}}}, Matrix{Float32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt8}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Vector{UInt32}, Matrix{Float32}, Vector{Vector{Float32}}, Matrix{UInt8}, Vector{Int32}}}, params::EvoTreeClassifier{EvoTrees.Softmax, Float32})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/fit.jl:142
  [7] fit(model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, verbosity::Int64, A::NamedTuple{(:matrix, :names), Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, Vector{Symbol}}}, y::SubArray{CategoricalArrays.CategoricalValue{Bool, UInt32}, 1, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, Tuple{Vector{Int64}}, false}, w::Nothing)
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/MLJ.jl:9
  [8] fit(model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, verbosity::Int64, A::NamedTuple{(:matrix, :names), Tuple{SubArray{Float64, 2, Matrix{Float64}, Tuple{Vector{Int64}, Base.Slice{Base.OneTo{Int64}}}, false}, Vector{Symbol}}}, y::SubArray{CategoricalArrays.CategoricalValue{Bool, UInt32}, 1, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, Tuple{Vector{Int64}}, false})
    @ EvoTrees ~/.julia/packages/EvoTrees/ayRL8/src/MLJ.jl:2
  [9] fit_only!(mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}; rows::Vector{Int64}, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:680
 [10] #fit!#63
    @ ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:778 [inlined]
 [11] fit_and_extract_on_fold
    @ ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1180 [inlined]
 [12] (::MLJBase.var"#307#308"{MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, Int64})(k::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1019
 [13] mapreduce_first
    @ ./reduce.jl:419 [inlined]
 [14] _mapreduce(f::MLJBase.var"#307#308"{MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, Int64}, op::typeof(vcat), #unused#::IndexLinear, A::UnitRange{Int64})
    @ Base ./reduce.jl:430
 [15] _mapreduce_dim
    @ ./reducedim.jl:365 [inlined]
 [16] #mapreduce#765
    @ ./reducedim.jl:357 [inlined]
 [17] mapreduce
    @ ./reducedim.jl:357 [inlined]
 [18] _evaluate!(func::MLJBase.var"#fit_and_extract_on_fold#330"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{LogLoss{Float64}}, Vector{typeof(predict)}, Bool, Bool, CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}}, DataFrame}, mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, #unused#::CPU1{Nothing}, nfolds::Int64, verbosity::Int64)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1018
 [19] evaluate!(mach::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{LogLoss{Float64}}, operations::Vector{typeof(predict)}, acceleration::CPU1{Nothing}, force::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1221
 [20] evaluate!(::Machine{EvoTreeClassifier{EvoTrees.Softmax, Float32}, true}, ::Holdout, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{LogLoss{Float64}}, ::Vector{typeof(predict)}, ::CPU1{Nothing}, ::Bool)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1292
 [21] fit(::Resampler{Holdout}, ::Int64, ::DataFrame, ::CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}})
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/resampling.jl:1448
 [22] fit_only!(mach::Machine{Resampler{Holdout}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:680
 [23] #fit!#63
    @ ~/.julia/packages/MLJBase/9Nkjh/src/machines.jl:778 [inlined]
 [24] event!(metamodel::EvoTreeClassifier{EvoTrees.Softmax, Float32}, resampling_machine::Machine{Resampler{Holdout}, false}, verbosity::Int64, tuning::Grid, history::Nothing, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:436
 [25] #35
    @ ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:474 [inlined]
 [26] iterate
    @ ./generator.jl:47 [inlined]
 [27] _collect(c::Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, itr::Base.Generator{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, MLJTuning.var"#35#36"{Machine{Resampler{Holdout}, false}, Int64, Grid, Nothing, NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, ProgressMeter.Progress}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base ./array.jl:807
 [28] collect_similar
    @ ./array.jl:716 [inlined]
 [29] map
    @ ./abstractarray.jl:2933 [inlined]
 [30] assemble_events!(metamodels::Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, resampling_machine::Machine{Resampler{Holdout}, false}, verbosity::Int64, tuning::Grid, history::Nothing, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, acceleration::CPU1{Nothing})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:473
 [31] build!(history::Nothing, n::Int64, tuning::Grid, model::EvoTreeClassifier{EvoTrees.Softmax, Float32}, model_buffer::Channel{Any}, state::NamedTuple{(:models, :fields, :parameter_scales, :models_delivered), Tuple{Vector{EvoTreeClassifier{EvoTrees.Softmax, Float32}}, Vector{Symbol}, Vector{Symbol}, Bool}}, verbosity::Int64, acceleration::CPU1{Nothing}, resampling_machine::Machine{Resampler{Holdout}, false})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:667
 [32] fit(::MLJTuning.ProbabilisticTunedModel{Grid, EvoTreeClassifier{EvoTrees.Softmax, Float32}}, ::Int64, ::DataFrame, ::CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}})
    @ MLJTuning ~/.julia/packages/MLJTuning/ZFg3R/src/tuned_models.jl:747
 [33] top-level scope
    @ ~/Dev/TARGENE/TargetedEstimation/sandbox.jl:23
@olivierlabayle
Copy link
Author

I've managed to reduce the example to the following, again let me know how to best share the dataset if you want to reproduce:

using CSV, DataFrames, MLJBase, EvoTrees
using StableRNGs

data = CSV.read("/Users/olivierlabayle/Downloads/pb_data.csv", DataFrame)
y = categorical(data.target)
X = data[!, Not(:target)]

train, test = MLJBase.train_test_pairs(Holdout(), 1:size(X, 1), X, y)[1]
rng = StableRNG(1)
model = EvoTreeClassifier(nround=100, lambda=1e-5, max_depth=7, rng=rng)
Xtrain, ytrain = MLJBase.reformat(model, selectrows(X, train), selectrows(y, train))
MLJBase.fit(model, 1, Xtrain, ytrain)

The issue arises because offset+length(is) at line 152 is bigger than the out size.

@jeremiedb
Copy link
Member

Thanks for raising this! Could you confirm the EvoTrees's version you're using?
I suspect the bug to be tied with the new rowsamling approach introduced in v0.14, but if it also occurs on v0.13, it would change the diagnosis.

@olivierlabayle
Copy link
Author

Yes this is with version v0.14.2. Out of curiosity I've tried v0.13 with 100 different random seeds and can't reproduce the bug so you are probably right!

@jeremiedb
Copy link
Member

I tried various runs, including with StableRNG with various seeds, but I couldn't get any that generated a failure.
So if you're willing to share the data, that could be most helpful (jeremie.desgagne.bouchard @ gmail.com)

@jeremiedb jeremiedb mentioned this issue Dec 25, 2022
@jeremiedb
Copy link
Member

@olivierlabayle Could you test current main branch?
I've pushed a fix that seems to resolve the bug you encountered.
I'll still need time to understand the root cause of the previous implementation spurious bugs, but current fix looks robust to all tests performed.

@olivierlabayle
Copy link
Author

Thanks, I confirm I can't seem to be able to reproduce the bug on this dataset with main.

@olivierlabayle
Copy link
Author

Did you manage to find the origin of the problem?

@jeremiedb
Copy link
Member

Not yet! It's actually quite puzzling as I failed to reduce the issue to a simpler reproducible problem. I'm afraid it's unlikely someone will be willing to investigate based on a full EvoTrees training that just hit an issue at 5th iteration.

That being, I think there are some relevant cues to help contnue the investigation. Notably, bugs appears to creep in in the update_gains! function.

  • If running experiments/debug-softmax-split-cpu up to

    @time m_evo = fit_evotree(params_evo; x_train, y_train, x_eval = x_train, y_eval = y_train, metric=:mlogloss, print_every_n = 1);
    , the following lines
    @info "minimum(hR[3,:,:])" minimum(hR[3, :, :])
    @info "minimum(hR2[3,:,:])" minimum(hR2[3, :, :])
    print the original (faulty) and new (cumsum) min values for the weights in each bin. Both values are the same as expected throughout the first iteration, but then start to diverge on the second iteration. It may be indicate something shoud should be initialized differently between iterations, but I couldn't identity anything problematic.

  • On the GPU side, the inclusion of the following else condition results in a failure to run the kernel:

    else
    gains[bin, j] = 0
    . This condition isn't necessary for the algo, but the fact that it fails may be symptomatic of the issue that also affect the CPU side.

My next step would be to try if I can be successful reproducing in a MWE the failure on the GPU side, so I can submit a relvant issue.

Let me know if there's something you'd like to investigate on your end.

@olivierlabayle
Copy link
Author

Thank you for the feedback! I will try to investigate further the original issue since I think I've just managed to trigger the same error on a similar dataset with v0.13.1. Since I don't know the internals it might take me some time though.

@jeremiedb
Copy link
Member

Have you encountered any new issue? With v0.15.0, I've paid closer attention to any numerical instabilities and found the new release to be realiable under all tested scenarios. I would therefore close given the significant revamp unless there're still scenarios leading to crashes.

@olivierlabayle
Copy link
Author

Sorry it was just faster to move to XGBoost.jl, I think you can close this and I'll try again later when I have time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants