Mlj table #262

jeremiedb · 2023-10-17T05:10:35Z

Fix #260
@ablaom I moved the MLJ wrapper from the orignal Matrix API to the new Tables one, which provides native support for richer inputs, incl. Ordered and Unordered (MultiClass) CategoricalValues.

codecov · 2023-10-17T05:16:51Z

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (535b0c4) 51.05% compared to head (a4c9dcb) 51.90%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #262      +/-   ##
==========================================
+ Coverage   51.05%   51.90%   +0.84%     
==========================================
  Files          22       22              
  Lines        1986     1944      -42     
==========================================
- Hits         1014     1009       -5     
+ Misses        972      935      -37

Files	Coverage Δ
src/fit-utils.jl	`90.36% <100.00%> (-1.36%)`	⬇️
src/fit.jl	`96.39% <100.00%> (+0.03%)`	⬆️
src/predict.jl	`92.24% <100.00%> (+0.06%)`	⬆️
src/MLJ.jl	`88.88% <92.30%> (+2.88%)`	⬆️
ext/EvoTreesCUDAExt/predict.jl	`0.00% <0.00%> (ø)`
ext/EvoTreesCUDAExt/fit-utils.jl	`0.00% <0.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ablaom · 2023-10-17T06:21:43Z

Happy to review this shortly.

ablaom · 2023-10-19T19:13:43Z

src/MLJ.jl

-MMI.selectrows(::EvoTypes, I, A, y) =
-  ((matrix=view(A.matrix, I, :), names=A.names), view(y, I))
-MMI.selectrows(::EvoTypes, I, A) = ((matrix=view(A.matrix, I, :), names=A.names),)
-


My understanding is this: previously table -> matrix conversion happened in reformat and not in EvoTree's internal training, which means MLJ's IteratedModel wrapper avoided repeating these conversions over and over in controlled iteration.

The change here suggests that table -> matrix conversion is now "internalised", which means the performance gain is lost, right? What am I missing here?

I'm afraid this brings back at the forefront the issue with regard to MLJ's handling of iterated process for models relying on caching / initialization mechanism.

Once an EvoTree model is initialized with a training dataset, this dataset should be considered as fixed, no subsampling should be performed to it throughout the iterations. The algo is responsible for performing the subsampling operations throughout the iterations using tricks that are algo specific. This is true AFAIK for all "high performance" GBT models (incl. XGBoost, LightGBM, CatBoost).

Looking at the error resulting when trying an IteratedModel uder this PR, why MLJ's fit_only! needs to perform a data resampling: fit(model, verbosity, _resampled_data(mach, model, rows)...)?

The logic that has been used with EvoTrees so far to handle the MMI's update method has been to completely ignore the data args A and y:

EvoTrees.jl/src/MLJ.jl

Line 34 in 20b9e2a

grow_evotree!(fitresult, cache, model)

I'm still unclear why an issue happened with IteratedModel at first since update behavior where only the model and cache are actually used when iterating the model. And the EvoTrees' MLJ tests pass, so the basic operations seem to work fine, including iterative calls to fit!.

This is still much WIP, but I'm trying to lay down what I have experiencced as shortcomings from ML toolkits, incl Scikitlearn, which assumes that a model can be instantiated with a MyModelRegressor rather than having distinct management of a model conguration, its initialization and the fitting of that instantiated model: https://github.com/Evovest/EvoTrees.jl/blob/learnAPI/experiments/learnAPI.jl. This is the kind of approach that we need to rely on internally to handle our ML pipelines, but I'd definitely like to see such support in an open framework.

@ablaom The issue I encountered testing on IteratedModel appeared to be a red herring: the failure was caused by the log_loss_ measure not accepting Deterministic output of the EvoTreeRegressor (isn't it unnecessarily restrictive? I actually don't see a logistic regression as less deterministic than "mse", but that's another topic!).
It seems like the MLJ's fit! call to to reformat effectively results in a no-op as expected (each iteration following the 1st one directly mutate the instantiate fitted_model and the cache). Training speed is aligned with internal's fit_evotree if using very large evaluation Step (but quickly blows up, going from 12 sec reference time to ~68 sec with Step(5)).

So back the this specific topic of moving the EvoTrees's table API, AFAICT the MLJ integration with MLJ works fine. Or at least as well as the existing one, but with added benefits in terms of richer input feature types.

ablaom · 2023-10-24T03:10:19Z

Okay, I think I was mistaken to think dumping the model-specific data pre-processing (reformat) and model-specific resampling (selectrows) would slow down IteratedModel. As you point out, we are only calling update after the first train, so the conversion table -> matrix only happens once. I have also confirmed this with a benchmark, where this PR actually runs slightly faster for IteratedModel, for Step sizes of 1, 5, 30 and a total of 300 iterations (details at the end).

On the other hand, I do expect some impact on TunedModel wrapping, for here changing the parameter to be tuned means we call fit and not update for each new parameter value. A benchmark suggest this is indeed the case, but the slowdown is small (assuming it's even statistically significant):

# on EvoTrees master branch:

@btime fit!($mach, verbosity=0)
#  15.437 μs (100 allocations: 4.55 KiB)

# on this PR:

@btime fit!($mach, verbosity=0)
#  17.439 μs (100 allocations: 4.55 KiB)

Benchmark details are attached below.

So, I won't push back against your reluctance to update the reformat/selectrows implementation, and assume you have good reasons for not doing so.

I'm not sure I fully understand all the objections to the current API, but I suggest we have this discussion in a different forum. (Still happy to take a call to discuss this.)

API changes to MLJModelInterface, if they are breaking, will not happen in any hurry.

I will proceed with the rest of my review shortly.

using MLJModels
using MLJIteration
using MLJBase
using BenchmarkTools
using StatisticalMeasures
using MLJTuning

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)


# # CONTROLLED ITERATION

# number of iterations:
N = 300

steps = [1, 5, 30]

iterated_boosters = [
    IteratedModel(
        model=booster,
        resampling=Holdout(fraction_train=0.8),
        controls=[Step(s), NumberLimit(round(Int, N/s))],
        measure=l1,
    )
    for s in steps
        ]

for ib in iterated_boosters
    mach = machine(ib, X, y)
    print("step=", ib.controls[1].n)
    @btime fit!($mach, verbosity=0)
    @assert report(mach).n_iterations == N
end

# on EvoTrees master branch:

# step=1  14.351 μs (101 allocations: 4.61 KiB)
# step=5  14.301 μs (101 allocations: 4.61 KiB)
# step=30  14.342 μs (101 allocations: 4.61 KiB)

# on this PR:

# step=1  13.701 μs (101 allocations: 4.61 KiB)
# step=5  13.726 μs (101 allocations: 4.61 KiB)
# step=30  13.776 μs (101 allocations: 4.61 KiB)


# # TUNING

r = range(booster, :lambda, lower = 0, upper = 0.5)
tuned_booster = TunedModel(
    booster,
    tuning=RandomSearch(),
    range=r,
    measure = l2,
    n = 500,
)

mach = machine(tuned_booster, X, y)

# on EvoTrees master branch:

@btime fit!($mach, verbosity=0)
#  15.437 μs (100 allocations: 4.55 KiB)

# on this PR:

@btime fit!($mach, verbosity=0)
#  17.439 μs (100 allocations: 4.55 KiB)

ablaom · 2023-10-24T03:27:56Z

@jeremiedb There seems to be a problem with handling tables that are row-based. Perhaps you are calling Tables.columnnames(X) directly on a table X (a common gotcha)?? You can only call this on tables implementing the AbstractColumns interface, or on rows of a table implementing the AbstractRow interface.

Pkg.activate("explore002", shared=true)

using MLJBase
using MLJModels
using Tables

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)

# this works:
mach = machine(booster, X, y) |> fit!

# this doesn't
X = Tables.rowtable(X)
mach = machine(booster, X, y) |> fit!

# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1] 
# Stacktrace:
#  [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64)                                                                                      
#    @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101
#  [2] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64}, w::Nothing)
#    @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
#  [3] fit(model::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, verbosity::Int64, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, y::Vector{Float64})            
#    @ EvoTrees ~/.julia/packages/EvoTrees/jzXEo/src/MLJ.jl:3
#  [4] fit_only!(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}; rows::Nothing, 
# verbosity::Int64, force::Bool, composite::Nothing)                                         
#    @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:680
#  [5] fit_only!
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:606 [inlined]
#  [6] #fit!#63
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:777 [inlined]
#  [7] fit!
#    @ ~/.julia/packages/MLJBase/fEiP2/src/machines.jl:774 [inlined]
#  [8] |>(x::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, f::typeof(fit!))
#    @ Base ./operators.jl:907
#  [9] top-level scope
#    @ REPL[12]:1

…able

jeremiedb · 2023-10-24T06:30:12Z

Perhaps you are calling Tables.columnnames(X) directly on a table X

Guilty :) The Tables implementation effectively assumed working with columntables. Since there were several dependencies to getcolumn internally, I've added a conversion to Tables.columntable. It doesn't add any overhead when Tables are already in such format which is the most common case I suppose (such as DataFrame).

Regarding the impact on TunedModel I'd expect it to be very small, as when a new hyper-prams configuration is tested, a full model initialisation must be performed. Such initialization time is essentially the same for both the original Matrix and the new Tables based internal APIs. The bulk of computing time during an EvoTree training is normally spent iterating over n trees; steps which are identical in EvoTrees regardless of if the input was originally a Matrix of a Tables.

Thanks a lot for taking the time to look at the PR. And sorry about the digression about the iterated API. I'm trying to gather some coherent proposal so as to have a constructive take, will ping back once done!

src/fit-utils.jl

ablaom · 2023-10-25T21:50:23Z

I think that row-table problem persists with predict:

EvoTreeBooster = @load EvoTreeRegressor
booster = EvoTreeBooster()

X, y = make_regression(1000, 5)
X = Tables.rowtable(X);

# smoke tests:
mach = machine(booster, X, y) |> fit!    # this now works, thanks
predict(mach, X) # fails

# ERROR: BoundsError: attempt to access 1000-element Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}} at index [1]

Stack trace

Stacktrace:
 [1] getcolumn(x::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, i::Int64)                                                                                      
   @ Tables ~/.julia/packages/Tables/NSGZI/src/Tables.jl:101
 [2] binarize(df::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}; fnames::NTuple{5, Symbol}, edges::Vector{Any})                                                    
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/fit-utils.jl:74
 [3] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, ::Type{EvoTrees.CPU}; ntree_limit::Int64)
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:91
 [4] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}}, ::Type{EvoTrees.CPU})          
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:83
 [5] predict(m::EvoTrees.EvoTree{EvoTrees.MSE, 1}, data::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}})                                
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/predict.jl:83
 [6] predict(#unused#::EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, fitresult::EvoTrees.EvoTree{EvoTrees.MSE, 1}, A::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}})    
   @ EvoTrees ~/.julia/packages/EvoTrees/3Ocxy/src/MLJ.jl:45
 [7] predict(mach::Machine{EvoTrees.EvoTreeRegressor{EvoTrees.MSE}, true}, Xraw::Vector{NamedTuple{(:x1, :x2, :x3, :x4, :x5), NTuple{5, Float64}}})                                
   @ MLJBase ~/.julia/packages/MLJBase/fEiP2/src/operations.jl:133
 [8] top-level scope
   @ REPL[42]:1

GPU launcher fixes

…able

jeremiedb · 2023-10-26T05:11:38Z

I think that row-table problem persists with predict:

Thanks for the catch. I had neglected adding a test for rowtables, which has now been added within the MLJ test suite (both fit! and predict). Tests are now passing followin the patch to convert to Tables.columntable in EvoTrees.predict.

Regarding the @spawn vs @thread, I've followed the discussions around it. The post you referred to caused the migration to :static I did in v0.15, which turned out ill advised as it resulted in issues for a user running EvoTrees under an already threaded routine. My understanding of the issue is that problems can be caused when one relies on the threadid() as a reference to the thread making the work. This is to best of my knowledge not a pattern that has been used in the codebase. I simply wanted to move all threaded operation to a common appraoch, as the histogram builing routines are already using @threads and provide some performance improvements over @spawn.

src/MLJ.jl

src/fit-utils.jl

ablaom

I have finished reviewing those parts of the PR that seemed relevant to the MLJ interface. It looks good to go.

jeremiedb added 3 commits October 16, 2023 21:51

integrate MLJ wrapper with Tables API

a71dc0d

integrate MLJ wrapper with Tables API

cbc6c0a

adapt MLJ wrapper to Tables API

20b9e2a

ablaom reviewed Oct 19, 2023

View reviewed changes

jeremiedb added 3 commits October 24, 2023 02:00

fix support for rowtables

77e8810

mix for rowtables input

cdff37f

Merge branch 'mlj-table' of github.com:Evovest/EvoTrees.jl into mlj-t…

9e5d4d6

…able

replace @Spawn with @threads

e47e804

ablaom reviewed Oct 25, 2023

View reviewed changes

src/fit-utils.jl Show resolved Hide resolved

jeremiedb added 3 commits October 26, 2023 00:43

MLJ test

57ac5dc

GPU launcher fixes

Merge branch 'mlj-table' of github.com:Evovest/EvoTrees.jl into mlj-t…

cd10204

…able

up benchmarks

d7bc136

ablaom reviewed Oct 27, 2023

View reviewed changes

src/MLJ.jl Outdated Show resolved Hide resolved

ablaom reviewed Oct 27, 2023

View reviewed changes

src/MLJ.jl Show resolved Hide resolved

ablaom reviewed Oct 27, 2023

View reviewed changes

src/MLJ.jl Outdated Show resolved Hide resolved

ablaom reviewed Oct 27, 2023

View reviewed changes

src/fit-utils.jl Show resolved Hide resolved

ablaom approved these changes Oct 27, 2023

View reviewed changes

add back Matrix support in MLJ

a4c9dcb

jeremiedb merged commit dbee39d into main Oct 27, 2023
8 checks passed

jeremiedb deleted the mlj-table branch October 27, 2023 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlj table #262

Mlj table #262

jeremiedb commented Oct 17, 2023 •

edited

Loading

codecov bot commented Oct 17, 2023 •

edited

Loading

ablaom commented Oct 17, 2023

ablaom Oct 19, 2023 •

edited

Loading

jeremiedb Oct 19, 2023

jeremiedb Oct 20, 2023

ablaom commented Oct 24, 2023 •

edited

Loading

ablaom commented Oct 24, 2023 •

edited

Loading

jeremiedb commented Oct 24, 2023 •

edited

Loading

ablaom commented Oct 25, 2023

jeremiedb commented Oct 26, 2023 •

edited

Loading

ablaom left a comment

Mlj table #262

Mlj table #262

Conversation

jeremiedb commented Oct 17, 2023 • edited Loading

codecov bot commented Oct 17, 2023 • edited Loading

Codecov Report

ablaom commented Oct 17, 2023

ablaom Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

jeremiedb Oct 19, 2023

Choose a reason for hiding this comment

jeremiedb Oct 20, 2023

Choose a reason for hiding this comment

ablaom commented Oct 24, 2023 • edited Loading

ablaom commented Oct 24, 2023 • edited Loading

jeremiedb commented Oct 24, 2023 • edited Loading

ablaom commented Oct 25, 2023

jeremiedb commented Oct 26, 2023 • edited Loading

ablaom left a comment

Choose a reason for hiding this comment

jeremiedb commented Oct 17, 2023 •

edited

Loading

codecov bot commented Oct 17, 2023 •

edited

Loading

ablaom Oct 19, 2023 •

edited

Loading

ablaom commented Oct 24, 2023 •

edited

Loading

ablaom commented Oct 24, 2023 •

edited

Loading

jeremiedb commented Oct 24, 2023 •

edited

Loading

jeremiedb commented Oct 26, 2023 •

edited

Loading