Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Boosting in multiple iterations #403

Closed
mtl-tony opened this issue Jan 28, 2023 · 8 comments
Closed

Question: Boosting in multiple iterations #403

mtl-tony opened this issue Jan 28, 2023 · 8 comments

Comments

@mtl-tony
Copy link
Contributor

I'm currently attempting to fit the EBM in multiple boosting iterations for a specific problem I have. I'd compare this to how first all main features are fit first then based off the residuals we fit the interaction terms as shown below in your code ebm.py.

Fitting main Features:

for idx in range(self.outer_bags):
parallel_args.append(
(
dataset,
bags[idx],
None,
term_features,
inner_bags,
boost_flags,
self.learning_rate,
self.min_samples_leaf,
self.max_leaves,
early_stopping_rounds,
early_stopping_tolerance,
self.max_rounds,
noise_scale,
bin_data_weights,
rngs[idx],
None,
)
)
results = provider.parallel(EBMUtils.cyclic_gradient_boost, parallel_args)

Fitting Interactions:
parallel_args = []
for idx in range(self.outer_bags):
parallel_args.append(
(
dataset,
bags[idx],
scores_bags[idx],
boost_groups,
inner_bags,
boost_flags,
self.learning_rate,
self.min_samples_leaf,
self.max_leaves,
early_stopping_rounds,
early_stopping_tolerance,
self.max_rounds,
noise_scale,
bin_data_weights,
rngs[idx],
None,
)
)
results = provider.parallel(EBMUtils.cyclic_gradient_boost, parallel_args)

For my problem I need to fit my main features in 2 iterations. I'm planning to first fit a subset of my feature X in a first boosting iteration which we can call model 1. Then would be to fit a different subset of features using the residuals from model 1 (leveraging the init_score available in the cyclic_gradient_boost function). I'm doing this by simply modifying the ebm.py code to allow me to make this adjustment.

I have 2 questions pertaining to this.

  1. Do you see any theoretical issues with fitting features with this methodology. (First fitting 1 subset of the features in the cyclic boosting procedure then fitting the other features afterwards in a second cyclic boosting iteration)
  2. Are there plans to breakdown the fit function eventually into sub fitting functions. Exp you could have _fit_main() and _fit_interactions(). For standard problems users could still call .fit() which would work the same way it does now and within the .fit() it would use the _fit_main() and _fit_interactions() functions. For more specific problems you could call them yourself in a specific order like the example below.

Step 1) ._fit_main() for a first subset of features
Step 2) ._fit_main() for the rest of the features
Step 3) ._fit_interactions() for your pairwise interactions
Step 4) ._fit_interactions() for your 3rd order + 4th order interactions that are available now as of 0.3

I appreciate any feedback you have :)

@paulbkoch
Copy link
Collaborator

Hi @mtl-tony -- This is a very good discussion to have, and to be honest it's something that I have been thinking about for a while without coming to a firm viewpoint yet. Everything you are proposing makes sense, and the main questions IMHO are about how to present this as a coherent interface.

I'll start by describing a few building blocks that I've mostly concluded we should have. First off, I think your earlier PR
(#371) is a critical piece in this puzzle. With an offset (init_score in the LightGBM terminology) we can boost an EBM on top of any existing model, including ones that are opaque to us. Boosting on top of opaque models might be a bit of a niche scenario, but I'm sure someone will want to do it at least for research purposes. Boosting on top of other glassbox models like linear models or splines makes a lot more sense to me. Of course, this parameter could also be used to boost an EBM on top of another EBM.

If we agree that we need an init_score parameter that accepts raw scores, then we might as well give it additional functionality and allow it to also accept a previously built model, or a predict function. Our measure_interactions function does this already for similar purposes:

and then:

probs = clean_dimensions(init_score.predict_proba(X_unclean), "init_score")

Another critical question is how we specify which features or interactions get boosted on within each phase. I have a few ideas on this which I'll post in the other thread that you're on ( #225 )

The last question, which I believe to be the hardest, is how we end up combining the models. A couple of ideas:

  • we could enable this via a post-processing utility function. We already have a merge_ebms function that perhaps could do this already (https://github.com/interpretml/interpret/blob/develop/examples/python/notebooks/Merging%20EBM%20Models.ipynb). I'd need to really think through this scenario before confidently saying whether the existing merge_ebms function fulfils all requirements, or whether we need to modify it.
  • if a model was passed to us via the init_score/offset parameter described above, we could automatically merge the models at the end, however this strikes me as a very bad interface since it doesn't match the functionality that you get if you pass in raw scores. Maybe there's a similar approach however to make it clearer to the caller.
  • we might be able to use the scikit_learn partial_fit API. To be honest, I've never used it myself, so it's on my list of things to investigate. In our scenario though we want to modify what features we're boosting on, and perhaps other parameters as well, so I'm not sure if partial_fit is a great "fit" for us.
  • some of my co-workers think we should define something akin to a scikit-learn pipeline that allows the caller to specify in advance how the multiple stages should be executed. We've worked through an example of how this might look, but there are still many questions that would need to be worked out.

My lean at this point would be to enable this scenario with a combination of init_score and merge_ebms. The plus here is that these functions already exist for other reasons, so re-using them adds no additional complexity to our API. The downside is that this isn't the most integrated approach in the way that a pipeline-like solution would be. We'd need to write some clear examples and documentation to highlight these capabilities at minimum.

There's another related aspect worth mentioning here: We try to intelligently guess the feature types, and also where to cut continuous features into separate bins. merge_ebms handles the scenario where the features were binned differently, but it's less than ideal when the bin cuts do not match up since the bins in the final model need to be a super-set of the models being merged. The number of cut points therefore increases with each merge operation when they do not match up. Ideally, if multiple EBMs are meant to be merged, they would all be constructed with the same feature preprocessor definitions. This will happen by default if you pass in the exact same data each time, so in the above scenario it shouldn't be a problem. Having some way of harmonizing the preprocessor definitions is something we'll need though to support other complex merging scenarios. You can already mostly do this through the feature_types parameter, but it's currently clunky and more complex than I'd like. Perhaps allowing something like "feature_types=other_ebm" might work. It's another area that needs more investigation.

I'm really interested in getting feedback on any of the above ideas, or suggestions for alternate approaches.

@mtl-tony mtl-tony reopened this Jan 31, 2023
@mtl-tony
Copy link
Contributor Author

mtl-tony commented Jan 31, 2023

Hello Paul,

Sorry about closing and reopening I misclicked. Thanks so much for taking the time to give such a detailed response and it's really cool seeing that your team seem to be working on some of these ideas already. I'm curious to what the best way to implement this concept of opening up the boosting to users is as I agree it's tough to make it fit in a nice way where users can play around with it. I would need to think more as well but I feel a pipeline method would work, I'm just unsure if it's the cleanest way as you mentioned. I'll try some stuff on my side and share with you if I have a prototype since as you said most of the functions are coded so it's just about figuring out the structure.

Regarding the preprocessing portion mentioned, I believe that the method you described of taking a different EBM as an input to allow for the same buckets would be a great feature "feature_types=other_ebm". For people that might need to update models, it might be ideal for them to not have the buckets change every time as to allow them to better compare new updated models on newer data.

I guess 2 follow up question to this discussion.

  1. The merge ebm function is that used for outer bags?
  2. Are there any current plans to integrate FAST but for 3rd order interaction? I know it's harder to interpret but I think at 3 orders of interaction in specific domains it would still be interesting to have as I've noticed them occur within my domain of data at times so it would be useful to see if there was a way to detect them. Judging by the algorithm the difficulty seems to be memory limitations as the number of combinations increases exponentially with the order.

@paulbkoch
Copy link
Collaborator

Hi @mtl-tony --

Yes, I agree that "feature_types=other_ebm" would be a nice feature to have in the package. It's also important in the context of federated learning, which was the original impetus for merge_ebms. In the federated learning scenario we would also prefer to have synchronized bins across the federated models. I think one possible solution there is to create an initial "shell" EBM by specifying zero boosting rounds on some representative dataset. This shell EBM would act sort of like a scikit-learn preprocessor, but with support for interactions, which isn't natural in the scikit-learn model. You could then distribute the shell EBM to the federated locations, and they would each train EBMs with a common binning definition via feature_types=shell_ebm. The shell EBM could also be editable with the suite of model editing tools that we have yet to write. Another option would be to design a new class for preprocessing that is interaction aware, but even if we do that, I think supporting feature_types=other_ebm would still be a nice to have.

It would be great to brainstorm more on the pipeline solution as it's the least threshed out idea, and having a prototype would be the best way to explore it.

We don't call merge_ebms for the outer bags, but we do share some internal processing (see this function which is called by both:

def _process_terms(n_classes, bagged_scores, bin_weights, bag_weights):
).

For 3rd order interactions, the memory issues are solvable since we really just need to keep a list of the top N interactions, and we can do that efficiently with a heap. A naive implementation would still need to examine all possible triples which grows at a cubic rate, so even after fixing the memory issues there are still CPU considerations. Probably with some simple heuristics though we can prune the search space aggressively. For example, we could only allow triples from the set of features which formed pairs. There are lots of variations on how you might do this though.

Lastly, the public interface needs a bit of thought. Today we have one parameter for main binning (max_bins=256) and pairs (max_interaction_bins=32). If we wanted triples, we'd probably want to change this to something like max_bins=256, max_pair_bins=32, max_higher_bins=8, or something similar. We'd also need a way to specify how many interactions to generate. Currently we have the interactions=10 parameter which only makes pairs, so we'd have to break that into pairs=10 and higher_interactions=5, or something like that. We also currently overload the interactions parameter to specify specific pairs/triples/etc with interactions=[(0,1), (1,2,11), ...], and that all gets a bit messier if we have multiple parameters for interactions. What if someone specifies pairs=[(1,2,3)] for example? Sure, we can throw an error if a triple is passed as a pair, but it just makes everything a bit more confusing to new users who probably just want to use defaults to start with anyways.

Given the unknowns, and the less than elegant interface changes (unless someone figures out something better), my default thinking is that maybe we should enable triples/quads/etc via the more complex and flexible interface that we're discussing here and see how people use it. If everyone raves about it, then consider making triples a more default experience.

One more aspect to mention: Today the measure_interactions function does not work with 3rd order interactions, but upgrading it to do so is relatively easy (in C++ though). That'll be important toward letting people experiment with GA3Ms.

@mtl-tony
Copy link
Contributor Author

mtl-tony commented Feb 1, 2023

Hello,

I've been thinking about this more today and I think the pipeline method would be nice. I think the advantage of the pipeline method is that you can give more freedom to your users but at the same time still include your core classes of ExplainableBoostingClassifier and ExplainableBoostingRegressor who just want a standard GA2M in which these classes simply build a predefined pipeline for you. I imagine the format would be somewhere along these lines of:
model = interpret.boosting_pipeline(
interpret.boosting(features = [1,2,3,4,5], interaction_level = 1, max_bins = 256),
interpret.fast_algo()
interpret.boosting(interaction_level = 2, max_bins = 32),
interpret.boosting(features = [(1,2,3)], interaction_level = 3, max_bins = 8),
)

And you would have the default pipelines that you currently have by simply calling ExplainableBoostingRegressor or ExplainableBoostingClassifier . Do you see any issues with framing the problem in this way? I feel torch nn and sklearn pipeline frameworks could be borrowed to make this work, but I wanted your opinion if I'm forgetting anything important in the building of an EBM model. I assume each boosting class or whatever it could be named would also include the preprocessing/binning portion.

Also regarding the C++ I'm no expert but I'm curious if you could send me where the Fast Implementation is and where it would need to be added for 3rd level interactions? Even if not to include in the model, I still think having a 3rd order interaction detection algorithm would be useful for general data analysis.

Thanks again,
Tony

@paulbkoch
Copy link
Collaborator

Hi @mtl-tony -- Makes sense to me, and I think that format is something that people in the community are comfortable with and probably expect.

I think the equivalent in a procedural format would be something like:

from interpret.glassbox import ExplainableBoostingClassifier
from interpret.utils import measure_interactions

ebm_mains = ExplainableBoostingClassifier(interactions=0)
ebm_mains.fit(X, y)

pairs = [x[0] for x in measure_interactions(X, y, init_score=ebm_mains)[0:10]]
ebm_pairs = ExplainableBoostingClassifier(interactions=pairs)
ebm_pairs.fit(X, y, merge=ebm_mains)

triples = [x[0] for x in measure_interactions(X, y, init_score=ebm_pairs, interaction_level=3)[0:10]]
ebm_triples = ExplainableBoostingClassifier(interactions=triples)
ebm_triples.fit(X, y, merge=ebm_pairs)

With the rule that if the parameter "merge" is used in the fit function then the mains are excluded, and things like init_score and the bin definitions are taken from the given model.

The obvious drawback of the procedural format is it's more verbose and requires repetitively passing in X, y, and the model from the previous stage. On the positive side, it's probably easier to introspect interior state like the pairs variable in a notebook environment, and also probably easier to insert more customized steps between the stages (think something weird like accessing a database or web API).

Perhaps the right approach is to have a pipeline like system as syntactic sugar over something like the above?

The C++ code that does interaction detection is here:

The code above needs to be changed to loop over an arbitrary number of dimensions up to k_cDimensionsMax. Handling tensors in C++ is a lot more work than in python. For an example, here's another function that loops over the tensor dimensions:

FastTotalState * pFastTotalState = &fastTotalState[0];

@mtl-tony
Copy link
Contributor Author

mtl-tony commented Feb 3, 2023

Hey @paulbkoch

Thanks for sharing the procedural format! I think if we have something like that it would fit most needs while still allowing users to 'look into' the model easily. Once those functions are ready I think it would be good and yeah the pipeline could come later as it would be easy to have once the procedural portion is there.

Main focus would be adding that merge parameter you mentioned so I'll start working on that portion as it's sort of one of the main keys to having the flexibility of fitting. Regarding the interaction detection algorithm I took a look and I think my C++ skills are not adequate yet so I'll maybe come back to that problem once I've had some more experience there.

Considering you've answered all my questions regarding this topic I'll close the issue and do a PR once I have something concrete regarding the fit + merge. Thanks so much for your insights.

Tony

@mtl-tony mtl-tony closed this as completed Feb 3, 2023
@paulbkoch
Copy link
Collaborator

paulbkoch commented Feb 3, 2023

Hi @mtl-tony -- That would be great! My recommendation here would be to try breaking up that bigger feature into smaller PR chunks. I would try to independently develop and test the "feature_types=other_ebm" ability first since that is going to be a necessary component for any "merge" parameter. Once that and the init_score are both in, then "merge" just becomes an act of expressing both together, and also excluding the mains, which we already have.

@paulbkoch
Copy link
Collaborator

Another possibility would be to use scikit-learn's warm_start methadology. It might look something like:

from interpret.glassbox import ExplainableBoostingClassifier
from interpret.utils import measure_interactions

ebm = ExplainableBoostingClassifier(interactions=0, warm_start=True)
ebm.fit(X, y)

ebm.mains = None  # or alternatively ebm.exclude = "mains"

pairs = [x[0] for x in measure_interactions(X, y, init_score=ebm)[0:10]]
ebm.interactions = pairs
ebm.fit(X, y)

triples = [x[0] for x in measure_interactions(X, y, init_score=ebm, interaction_level=3)[0:10]]
ebm.interactions = triples
ebm.learning_rate = 0.005 # we can also set other parameters
ebm.fit(X, y)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants