Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smoothness over variable regions with outlier outcome values #548

Open
eadains opened this issue Jun 13, 2024 · 11 comments
Open

Smoothness over variable regions with outlier outcome values #548

eadains opened this issue Jun 13, 2024 · 11 comments

Comments

@eadains
Copy link

eadains commented Jun 13, 2024

I'm training an EBM regressor on insurance loss data where the outcome has many zero values and large right skew. This is my model configuration:

model = ExplainableBoostingRegressor(
    feature_names=X.columns,
    feature_types=[
        "nominal",
        "continuous",
        "continuous",
        "continuous",
    ],
    objective="tweedie_deviance:variance_power=1.7",
    interactions=0,
    outer_bags=100,
)

The curve for one of the continuous predictors looks like this:
default
where there is that dip between values 5 and 6. In this case, that region of the predictor space happens to contain mostly small outcome values. Here's another predictor:
default2
where there is a large jump up for a small region for small values of the parameter on the left side. Here, this region happens to contain a few large predictor values.

Essentially, there are some regions of the predictor space that contain some outliers in the outcome variable compared to other nearby regions. I'm wondering if there is a good way to "smooth out" these regions. I have tried increasing the validation size to 0.50, and that does seem to help:
smoothed_1
smoothed_2
but it also seems to dampen the responsiveness across the board. I'm not entirely sure, from an algorithmic perspective, how changing the validation size would affect this, so any thoughts there would also be helpful.

@paulbkoch
Copy link
Collaborator

Hi @eadains -- The magic parameter you seek is called "smoothing_rounds". We have an example notebook dedicated to smoothing rounds at: https://interpret.ml/docs/python/examples/interpretable-regression-synthetic.html

Changing the validation size seems to work nicely here. I haven't tried such a large percentage myself, and I'm a bit surprised it's that effective. Seeing it here makes me think that perhaps the current default of 0.15 is creating outer bags that are too homogeneous. If the full dataset happens to have outliers in some section of the graph, then the majority of the outer bagged models might agree on an outlier prediction in that section. Allowing the outer bags to diverge more might combat this effect.

Another parameter that can sometimes help with smoothness is the "greediness" parameter, or "greedy_ratio" in more recent versions. The original cyclic boosting algorithm that we initially released with can sometimes overfit individual features when you have a scenario where some features require a lot more boosting than others. Mostly this occurs when you have a dataset that mixes nominal and continuous features. To avoid this, the current algorithm has been changed to a hybrid between cyclic boosting and traditional greedy boosting. I've been calling it semi-greedy boosting. The "greediness" parameter has been around for a while, but it became the default in v0.6.0, so depending on what version you are running you might want to try turning it on manually.

Some of your graphs appear to be monotonic, at least in some parts. Applying monotonic constraints can sometimes improve smoothness. We support both training time monotonicity (monotone_constraints parameter) and post-processed monotonicity via the "monotonize" function.

As a last resort, you can always edit the model after fitting. GAMChanger (https://github.com/interpretml/gam-changer) is probably the best solution I can recommend there, but I think GAMChanger hasn't been modified yet to handle alternative loss functions, so someone needs to update it for that. Another alternative is to edit the model directly using the ebm.term_scores_ attribute.

If you're not using the latest version (0.6.1) of interpret, I would also recommend upgrading. The 0.6.0 release was mostly focused on improving the accuracy of EBMs, but there were some other changes that might lead to smoother models, especially for scenarios involving alternative loss functions.

Some other ideas: You could adjust your bin cuts so that all the outlier samples are placed into a single bin that includes additional samples from either side. Bin adjustment is done through the "feature_types" attribute which is documented at: https://interpret.ml/docs/python/api/ExplainableBoostingRegressor.html#explainableboostingregressor

@richcaruana might have more ideas.

@paulbkoch
Copy link
Collaborator

If any of these things work for you @eadains, please let us know. This is an area we'd like to get more feedback on, and could lead us to changing the default parameters if we can find something consistent to improve.

@eadains
Copy link
Author

eadains commented Jun 17, 2024

Paul,

Thank you for making the effort to look at this! I've been look at a different insurance dataset, available here if you want to fiddle with it yourself, and I have some more observations. The target variable is Pure Premium, which is total loss over exposure. It seems to me that the real problem here is with outliers generally.

I tried out the smoothing rounds parameter:

model = ExplainableBoostingRegressor(
    feature_names=X.columns,
    feature_types=[
        "nominal",
        "continuous",
        "continuous",
        "continuous",
        "continuous",
        "nominal",
        "nominal",
        "continuous",
        "nominal",
    ],
    objective="tweedie_deviance:variance_power=1.7",
    smoothing_rounds=4000,
    greedy_ratio=0
)

and while this does makes things, well, smoother, it also seems to "pin" the resulting curves around the outlier points. You can see it here, where I set the smoothing rounds to 200 (the default), 2000, and 4000:
image
Screenshot 2024-06-16 at 8 06 17 PM
Screenshot 2024-06-16 at 8 06 28 PM
where there is a single point with a very large outcome at that spike. Increasing the greedy ratio seems to make the problem worse not better.

Increasing outer bags and the validation size parameter still seems to do pretty well:

model = ExplainableBoostingRegressor(
    feature_names=X.columns,
    feature_types=[
        "nominal",
        "continuous",
        "continuous",
        "continuous",
        "continuous",
        "nominal",
        "nominal",
        "continuous",
        "nominal",
    ],
    objective="tweedie_deviance:variance_power=1.7",
    validation_size=0.50,
    outer_bags=100
)

Screenshot 2024-06-16 at 8 13 00 PM
Although it is still noticeable at some points. Increasing validation size to 0.75 smooths them out completely and gives very nice curves matching what I would expect, lthough this dataset is an order of magnitude larger than the one I originally posted about, and has even more extreme outliers in the outcome. It's also interesting that, run this way, the effect size of this variable is much smaller, but is increasing over this region, as opposed to the graphs above.

Given that increasing outer bags and smoothing rounds together seems to dramatically increase fitting time, I've found a nice balance by setting only:

validation_size = 0.50
smoothing_rounds = 2000

and leaving outer bags at its default. The same curve, again:
Screenshot 2024-06-16 at 8 54 47 PM

I hope my waffling may be useful. One thing I can say is that increasing the validation size does seem to improve smoothness regardless of what other hyperparameters are set. I think your explanation of the effect makes sense. It may be especially relevant in datasets with significant outliers, because increasing validation size means that fewer of the bagged datasets will contain those outliers, giving a more well-behaved average prediction. Smoothing rounds also seems to help with this.

If I have some extra time, I may play around with generating some synthetic data. I've found that insurance datasets are particularly strange because the outcome is mostly zeros with a few very large values. I've also been thinking that it may be possible to do some post-processing of the curves using a lasso-penalized regression, like trend filtering, to smooth the curves, but we'll see if I ever get around to implementing that!

@eadains
Copy link
Author

eadains commented Jun 17, 2024

Just another thought: I assume that right now we're taking the mean of the bagged estimators. Would it make any sense to implement an option to take the median instead?

@paulbkoch
Copy link
Collaborator

Hi @eadains -- This is all very fascinating. I really like your idea of offering a median bagged estimator option. If you're in the mood to experiment, this is something that you can easily try yourself. We preserve the original bagged estimators in the bagged_scores_ attribute. The term_scores_ attribute, which holds the values used for prediction and visualization, can be completely recalculated from bagged_scores_. If you were to take the median from bagged_scores_ and put that into term_scores_ it would visualize and also predict using that median. There's also a bagged_intercept_ and intercept_ that would need to be updated in the same fashion.

On the other insurance-based graphs, I have to admit I'm a little more confused than before if increasing validation works under the scenario of having a single enormous outlier. You would think that at least some of the bags would contain this single outlier, and they would have the big bump you showed in the original graph. I would expect it to be less prominent, but still present.

Regarding the greediness parameter, I think that using greediness is allowing the algorithm to put most of the weight of the single outlier into a single feature that is most predictive. Using something more cyclic and less greedy would instead spread the effect over all your features, which could be both good and bad. Perhaps if you looked at the feature values on the outlier's other features and then examined each feature graph at those points you might see bumps that were not as obvious beforehand.

It's interesting to think about how an insurance company might want to resolve these issues. Clearly, you don't want to quote future potential customers enormous sums if they just happen to have a feature value of 510 (my estimate of that bump's location), but removing such outliers makes the model blind to black swan events. This feels like an area that a company might want to use model editing for by asking experts which value ranges the outlier might be expected to occur in, and then as a post-process step spread that bump's weight throughout the feature range that the experts identify. A single sample isn't going to provide enough information on its own to do this, but conceivably an expert could.

Another parameter you might want to play with is the min_samples_leaf parameter. Requiring more samples per leaf could average down the bumps if set sufficiently high.

@StrudelDoodleS
Copy link

Hi Both,

I saw this and hoped to add some insight as I also work with insurance data. I have tried to take inspiration from traditional modelling software that insurers have used in the past (EMBLEM from WTW in my case).

What they tend to do is quite similar to what @eadains has suggested about post-processing. Where they will come up with bins for a feature arbitrarily, fit the model, post process with polynomial regression of some time even with some manual adjustment of a particular bin if necessary.

What I ended up doing is to sort of solve a similar issue that you have here (though my outliers were never as strong as this peak you have at ~510) is to use optuna with group-kfold to search on many of the params listed. However this will definitely not be feasible if your dataset is large, it took quite a few hours on a dataset of ~50k rows. I never tweaked my validation size however come to think of it... oops!

I also built some utility classes to smooth my scores using scipy.optimise.curve_fit and / or LOWESS looking towards splines in the future (though it seems a little ridiculous to add splines to the output; kind of like fitting a GAM to the output of this model haha).

What I also learned just a few days ago is that they re-fit the models with there post-processed curves frozen in the model. Example: you have 4 continuous features you adjust one of them manually you force the model to not re-weight it whilst fitting, not sure if that is possible in EBM's or not / if it is actually useful. I wonder what both of your thoughts are on this?

I would also highly recommend making a frequency and a severity model to see what drives a particular features score a bit better. This helped me understand why certain dips/spikes occur also.

Hopefully this wasn't to rambly haha.

@paulbkoch
Copy link
Collaborator

Thanks for the input @StrudelDoodleS. I like your idea of distilling an EBM into a traditional spline GAM and have wondered about this myself in the context of time series models where you may want to routinely predict something beyond the end of a feature's range in the training set. Traditional splines have advantages in this space where they might be able to reasonably extrapolate a polynomial for some distance past the end of the feature graph.

Freezing features is a currently supported methodology, although it isn't as clean or easy as I would like yet. In the future we'd like to support scikit-learn's warm_start functionality (see #403). Today, you can do this using a combination of init_score, exclude, and merge_ebms. I put an example of this in our documentation: https://interpret.ml/docs/python/examples/custom-interactions.html. The example notebook shows how to obtain higher dimensional interactions, but you can adapt it for freezing features since you need similar functionality where you build several models on top of each other. I should probably write a new example focused solely on freezing features.

@eadains
Copy link
Author

eadains commented Jun 19, 2024

Okay, I experimented with the median bagged approach, and it seems quite promising. I'm using that same French Auto claims dataset, but I've sampled 50,000 datapoints, using a seed that is cherry picked to pick up some of the more extreme outlier values, just for faster fit times.

Here's my model object definition:

model = ExplainableBoostingRegressor(
    feature_names=X.columns,
    feature_types=[
        "nominal",
        "continuous",
        "continuous",
        "continuous",
        "continuous",
        "nominal",
        "nominal",
        "continuous",
        "nominal",
    ],
    objective="tweedie_deviance:variance_power=1.7",
    interactions=0,
)
model.fit(X, y)

So all default except for 0 interactions for simplicity.

Here are two term graphs:
Screenshot 2024-06-19 at 2 26 44 PM
Screenshot 2024-06-19 at 2 27 01 PM
Where we can see the outlier problem clearly.

I used this code to change the term scores:

for idx, term in enumerate(model.bagged_scores_):
    median = np.median(term, axis=0)
    mad = np.median(np.abs(term - median), axis=0)
    model.term_scores_[idx] = median
    model.standard_deviations_[idx] = mad

model.intercept_ = np.median(model.bagged_intercept_)

I've also modified the standard deviations to instead be the Median Absolute Deviation.

This gives these term graphs for the same two features:
Screenshot 2024-06-19 at 2 30 13 PM
Screenshot 2024-06-19 at 2 30 29 PM
These look very nice to me. The "Density" variable graph definitely is more locally jagged, but this approach does a very nice job of smoothing over those outlier regions.

Increasing outer bags to 100 helps:
Screenshot 2024-06-19 at 2 36 17 PM

Setting outer bags back to the default and increasing smoothing rounds to 800 also helps:
Screenshot 2024-06-19 at 2 37 23 PM

This is with

outer_bags = 56
smoothing_rounds=800
Screenshot 2024-06-19 at 2 39 08 PM

My assessment here is that by using the median, the expected behavior of increasing outer bags and smoothing rounds is restored compared to what I was seeing before adjusting those parameters with the mean approach. The performance of this even using the default model options seems very promising to me, given how little extra computation this adds. Hope this is helpful!

@paulbkoch
Copy link
Collaborator

Very interesting @eadains! Out of curiosity, how did the median model perform vs the default mean based model?

@eadains
Copy link
Author

eadains commented Jun 19, 2024

Hacking together some cross-validation code:

splitter = KFold(n_splits=10)

model = ExplainableBoostingRegressor(
    feature_names=X.columns,
    feature_types=[
        "nominal",
        "continuous",
        "continuous",
        "continuous",
        "continuous",
        "nominal",
        "nominal",
        "continuous",
        "nominal",
    ],
    objective="tweedie_deviance:variance_power=1.7",
    interactions=0,
)

losses = []
for train_idx, test_idx in splitter.split(X, y):
    model.fit(X[train_idx], y[train_idx])
    
    # ----- Median Adjustment -----
    for idx, term in enumerate(model.bagged_scores_):
        median = np.median(term, axis=0)
        mad = np.median(np.abs(term - median), axis=0)
        model.term_scores_[idx] = median
        model.standard_deviations_[idx] = mad

    model.intercept_ = np.median(model.bagged_intercept_)
    # -----------------------------

    model_out = model.predict(X[test_idx])
    loss = mean_tweedie_deviance(y[test_idx], model_out, power=1.7)
    losses.append(loss)

    print("iteration complete")

Mean score for the default mean approach: 49.924058183738566
Mean score for the median approach: 49.87414185077655
where lower is better for deviance. Both KFold and mean_tweedie_deviance are from Sklearn if that isn't obvious.

A minuscule difference in this case. Although, there is about a 5% drop in the standard deviation of the CV scores using the median adjustment, so perhaps slightly higher stability.

@StrudelDoodleS
Copy link

Thanks for the code snippet @eadains. I can also confirm from a data set at work that this is producing slightly better results when using the median for the frequency model using a Poisson deviance

I will try this for burn cost/pure premium with a tweedie objective to see if it makes a larger difference, burn cost and severity are noisier usually so this might be more beneficial.

@paulbkoch Thanks for the link & advice I will check it out. On the splines aspect / manual smoothing, this is both for business reasons and stakeholder demands. Giving them a features term scores that move around too much is harder to approve etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants