-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smoothness over variable regions with outlier outcome values #548
Comments
Hi @eadains -- The magic parameter you seek is called "smoothing_rounds". We have an example notebook dedicated to smoothing rounds at: https://interpret.ml/docs/python/examples/interpretable-regression-synthetic.html Changing the validation size seems to work nicely here. I haven't tried such a large percentage myself, and I'm a bit surprised it's that effective. Seeing it here makes me think that perhaps the current default of 0.15 is creating outer bags that are too homogeneous. If the full dataset happens to have outliers in some section of the graph, then the majority of the outer bagged models might agree on an outlier prediction in that section. Allowing the outer bags to diverge more might combat this effect. Another parameter that can sometimes help with smoothness is the "greediness" parameter, or "greedy_ratio" in more recent versions. The original cyclic boosting algorithm that we initially released with can sometimes overfit individual features when you have a scenario where some features require a lot more boosting than others. Mostly this occurs when you have a dataset that mixes nominal and continuous features. To avoid this, the current algorithm has been changed to a hybrid between cyclic boosting and traditional greedy boosting. I've been calling it semi-greedy boosting. The "greediness" parameter has been around for a while, but it became the default in v0.6.0, so depending on what version you are running you might want to try turning it on manually. Some of your graphs appear to be monotonic, at least in some parts. Applying monotonic constraints can sometimes improve smoothness. We support both training time monotonicity (monotone_constraints parameter) and post-processed monotonicity via the "monotonize" function. As a last resort, you can always edit the model after fitting. GAMChanger (https://github.com/interpretml/gam-changer) is probably the best solution I can recommend there, but I think GAMChanger hasn't been modified yet to handle alternative loss functions, so someone needs to update it for that. Another alternative is to edit the model directly using the ebm.term_scores_ attribute. If you're not using the latest version (0.6.1) of interpret, I would also recommend upgrading. The 0.6.0 release was mostly focused on improving the accuracy of EBMs, but there were some other changes that might lead to smoother models, especially for scenarios involving alternative loss functions. Some other ideas: You could adjust your bin cuts so that all the outlier samples are placed into a single bin that includes additional samples from either side. Bin adjustment is done through the "feature_types" attribute which is documented at: https://interpret.ml/docs/python/api/ExplainableBoostingRegressor.html#explainableboostingregressor @richcaruana might have more ideas. |
If any of these things work for you @eadains, please let us know. This is an area we'd like to get more feedback on, and could lead us to changing the default parameters if we can find something consistent to improve. |
Paul, Thank you for making the effort to look at this! I've been look at a different insurance dataset, available here if you want to fiddle with it yourself, and I have some more observations. The target variable is Pure Premium, which is total loss over exposure. It seems to me that the real problem here is with outliers generally. I tried out the smoothing rounds parameter: model = ExplainableBoostingRegressor(
feature_names=X.columns,
feature_types=[
"nominal",
"continuous",
"continuous",
"continuous",
"continuous",
"nominal",
"nominal",
"continuous",
"nominal",
],
objective="tweedie_deviance:variance_power=1.7",
smoothing_rounds=4000,
greedy_ratio=0
) and while this does makes things, well, smoother, it also seems to "pin" the resulting curves around the outlier points. You can see it here, where I set the smoothing rounds to 200 (the default), 2000, and 4000: Increasing outer bags and the validation size parameter still seems to do pretty well: model = ExplainableBoostingRegressor(
feature_names=X.columns,
feature_types=[
"nominal",
"continuous",
"continuous",
"continuous",
"continuous",
"nominal",
"nominal",
"continuous",
"nominal",
],
objective="tweedie_deviance:variance_power=1.7",
validation_size=0.50,
outer_bags=100
)
Given that increasing outer bags and smoothing rounds together seems to dramatically increase fitting time, I've found a nice balance by setting only: validation_size = 0.50
smoothing_rounds = 2000 and leaving outer bags at its default. The same curve, again: I hope my waffling may be useful. One thing I can say is that increasing the validation size does seem to improve smoothness regardless of what other hyperparameters are set. I think your explanation of the effect makes sense. It may be especially relevant in datasets with significant outliers, because increasing validation size means that fewer of the bagged datasets will contain those outliers, giving a more well-behaved average prediction. Smoothing rounds also seems to help with this. If I have some extra time, I may play around with generating some synthetic data. I've found that insurance datasets are particularly strange because the outcome is mostly zeros with a few very large values. I've also been thinking that it may be possible to do some post-processing of the curves using a lasso-penalized regression, like trend filtering, to smooth the curves, but we'll see if I ever get around to implementing that! |
Just another thought: I assume that right now we're taking the mean of the bagged estimators. Would it make any sense to implement an option to take the median instead? |
Hi @eadains -- This is all very fascinating. I really like your idea of offering a median bagged estimator option. If you're in the mood to experiment, this is something that you can easily try yourself. We preserve the original bagged estimators in the bagged_scores_ attribute. The term_scores_ attribute, which holds the values used for prediction and visualization, can be completely recalculated from bagged_scores_. If you were to take the median from bagged_scores_ and put that into term_scores_ it would visualize and also predict using that median. There's also a bagged_intercept_ and intercept_ that would need to be updated in the same fashion. On the other insurance-based graphs, I have to admit I'm a little more confused than before if increasing validation works under the scenario of having a single enormous outlier. You would think that at least some of the bags would contain this single outlier, and they would have the big bump you showed in the original graph. I would expect it to be less prominent, but still present. Regarding the greediness parameter, I think that using greediness is allowing the algorithm to put most of the weight of the single outlier into a single feature that is most predictive. Using something more cyclic and less greedy would instead spread the effect over all your features, which could be both good and bad. Perhaps if you looked at the feature values on the outlier's other features and then examined each feature graph at those points you might see bumps that were not as obvious beforehand. It's interesting to think about how an insurance company might want to resolve these issues. Clearly, you don't want to quote future potential customers enormous sums if they just happen to have a feature value of 510 (my estimate of that bump's location), but removing such outliers makes the model blind to black swan events. This feels like an area that a company might want to use model editing for by asking experts which value ranges the outlier might be expected to occur in, and then as a post-process step spread that bump's weight throughout the feature range that the experts identify. A single sample isn't going to provide enough information on its own to do this, but conceivably an expert could. Another parameter you might want to play with is the min_samples_leaf parameter. Requiring more samples per leaf could average down the bumps if set sufficiently high. |
Hi Both, I saw this and hoped to add some insight as I also work with insurance data. I have tried to take inspiration from traditional modelling software that insurers have used in the past (EMBLEM from WTW in my case). What they tend to do is quite similar to what @eadains has suggested about post-processing. Where they will come up with bins for a feature arbitrarily, fit the model, post process with polynomial regression of some time even with some manual adjustment of a particular bin if necessary. What I ended up doing is to sort of solve a similar issue that you have here (though my outliers were never as strong as this peak you have at ~510) is to use optuna with group-kfold to search on many of the params listed. However this will definitely not be feasible if your dataset is large, it took quite a few hours on a dataset of ~50k rows. I never tweaked my validation size however come to think of it... oops! I also built some utility classes to smooth my scores using scipy.optimise.curve_fit and / or LOWESS looking towards splines in the future (though it seems a little ridiculous to add splines to the output; kind of like fitting a GAM to the output of this model haha). What I also learned just a few days ago is that they re-fit the models with there post-processed curves frozen in the model. Example: you have 4 continuous features you adjust one of them manually you force the model to not re-weight it whilst fitting, not sure if that is possible in EBM's or not / if it is actually useful. I wonder what both of your thoughts are on this? I would also highly recommend making a frequency and a severity model to see what drives a particular features score a bit better. This helped me understand why certain dips/spikes occur also. Hopefully this wasn't to rambly haha. |
Thanks for the input @StrudelDoodleS. I like your idea of distilling an EBM into a traditional spline GAM and have wondered about this myself in the context of time series models where you may want to routinely predict something beyond the end of a feature's range in the training set. Traditional splines have advantages in this space where they might be able to reasonably extrapolate a polynomial for some distance past the end of the feature graph. Freezing features is a currently supported methodology, although it isn't as clean or easy as I would like yet. In the future we'd like to support scikit-learn's warm_start functionality (see #403). Today, you can do this using a combination of init_score, exclude, and merge_ebms. I put an example of this in our documentation: https://interpret.ml/docs/python/examples/custom-interactions.html. The example notebook shows how to obtain higher dimensional interactions, but you can adapt it for freezing features since you need similar functionality where you build several models on top of each other. I should probably write a new example focused solely on freezing features. |
Okay, I experimented with the median bagged approach, and it seems quite promising. I'm using that same French Auto claims dataset, but I've sampled 50,000 datapoints, using a seed that is cherry picked to pick up some of the more extreme outlier values, just for faster fit times. Here's my model object definition: model = ExplainableBoostingRegressor(
feature_names=X.columns,
feature_types=[
"nominal",
"continuous",
"continuous",
"continuous",
"continuous",
"nominal",
"nominal",
"continuous",
"nominal",
],
objective="tweedie_deviance:variance_power=1.7",
interactions=0,
)
model.fit(X, y) So all default except for 0 interactions for simplicity. Here are two term graphs: I used this code to change the term scores: for idx, term in enumerate(model.bagged_scores_):
median = np.median(term, axis=0)
mad = np.median(np.abs(term - median), axis=0)
model.term_scores_[idx] = median
model.standard_deviations_[idx] = mad
model.intercept_ = np.median(model.bagged_intercept_) I've also modified the standard deviations to instead be the Median Absolute Deviation. This gives these term graphs for the same two features: Increasing outer bags to 100 helps: Setting outer bags back to the default and increasing smoothing rounds to 800 also helps: This is with outer_bags = 56
smoothing_rounds=800 My assessment here is that by using the median, the expected behavior of increasing outer bags and smoothing rounds is restored compared to what I was seeing before adjusting those parameters with the mean approach. The performance of this even using the default model options seems very promising to me, given how little extra computation this adds. Hope this is helpful! |
Very interesting @eadains! Out of curiosity, how did the median model perform vs the default mean based model? |
Hacking together some cross-validation code: splitter = KFold(n_splits=10)
model = ExplainableBoostingRegressor(
feature_names=X.columns,
feature_types=[
"nominal",
"continuous",
"continuous",
"continuous",
"continuous",
"nominal",
"nominal",
"continuous",
"nominal",
],
objective="tweedie_deviance:variance_power=1.7",
interactions=0,
)
losses = []
for train_idx, test_idx in splitter.split(X, y):
model.fit(X[train_idx], y[train_idx])
# ----- Median Adjustment -----
for idx, term in enumerate(model.bagged_scores_):
median = np.median(term, axis=0)
mad = np.median(np.abs(term - median), axis=0)
model.term_scores_[idx] = median
model.standard_deviations_[idx] = mad
model.intercept_ = np.median(model.bagged_intercept_)
# -----------------------------
model_out = model.predict(X[test_idx])
loss = mean_tweedie_deviance(y[test_idx], model_out, power=1.7)
losses.append(loss)
print("iteration complete") Mean score for the default mean approach: A minuscule difference in this case. Although, there is about a 5% drop in the standard deviation of the CV scores using the median adjustment, so perhaps slightly higher stability. |
Thanks for the code snippet @eadains. I can also confirm from a data set at work that this is producing slightly better results when using the median for the frequency model using a Poisson deviance I will try this for burn cost/pure premium with a tweedie objective to see if it makes a larger difference, burn cost and severity are noisier usually so this might be more beneficial. @paulbkoch Thanks for the link & advice I will check it out. On the splines aspect / manual smoothing, this is both for business reasons and stakeholder demands. Giving them a features term scores that move around too much is harder to approve etc. |
I'm training an EBM regressor on insurance loss data where the outcome has many zero values and large right skew. This is my model configuration:
The curve for one of the continuous predictors looks like this:
where there is that dip between values 5 and 6. In this case, that region of the predictor space happens to contain mostly small outcome values. Here's another predictor:
where there is a large jump up for a small region for small values of the parameter on the left side. Here, this region happens to contain a few large predictor values.
Essentially, there are some regions of the predictor space that contain some outliers in the outcome variable compared to other nearby regions. I'm wondering if there is a good way to "smooth out" these regions. I have tried increasing the validation size to 0.50, and that does seem to help:
but it also seems to dampen the responsiveness across the board. I'm not entirely sure, from an algorithmic perspective, how changing the validation size would affect this, so any thoughts there would also be helpful.
The text was updated successfully, but these errors were encountered: