[ENH] `GLMRegressor` using statsmodels `GLM` with gaussian link #222

julian-fong · 2024-03-25T00:09:22Z

Reference Issues/PRs

Implements a GLM model for the gaussian link in #7 , looking for feedback on the code changes

What does this implement/fix? Explain your changes.

The Gaussian Regressor is a direct interface to the statsmodel package's GLM model, and _fit, _predict, and _predict_proba have all been defined as noted in the regression extension template (have not started on the class method get_test_params)

Does your contribution introduce a new dependency? If yes, which one?

Yes, this contribution introduces a new dependency to the statsmodel package. A link to the package page can be found here: https://www.statsmodels.org/stable/index.html

What should a reviewer concentrate their feedback on?

This is just a rough draft (and there are probably loads of mistakes!) of the implementation. I am looking for feedback on the areas where I could improve/fix on code wise (on the constructor, fit, and predict methods), or even design wise.

Something I noticed in the GLM statsmodels package is that the fitted GLM model is returned as a separate object, as it is returned as part of a GLMResultsWrapper object.

Datasets X, y are to be defined in the constructor of the model, and not inside the fit function, so I opted to include the params of the GLM model's fit function inside the _fit function in Gaussian Regressor. (link to the fit function: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.fit.html#statsmodels.genmod.generalized_linear_model.GLM.fit)

I was considering whether or not to define the statsmodel GLM object inside the fit function instead of the constructor (similar to the MapieRegressor to follow the template to the letter, but later decided against. I am wondering if that is the recommended option to implement the GaussianRegressor as well.

Appreciate the feedback!

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

fkiraly

Thanks, this is great!
Looks like we're almost done.

What is important is that you follow the "scikit-learn-like" interface precisely, and that means:

no data should be passed in __init__. The data is X and y. If this needs to be renamed or converged (e.g., to endog etc), do that in _fit or _predict
_fit should have the signature def _fit(self, X, y):, no additional arguments. If there are additional arguments, they need to be in __init__ and read from self.argname

Also see the regression.py extension template.

Minor things, but blocking (since it's code formatting):

ensure you follow code formatting guidelines, here's the guide on setting it up locally: https://www.sktime.net/en/stable/developer_guide/coding_standards.html
you accidentally added a newline in _sklearn.py, kindly revert that

Let us know if you need any help or have any questions about the above (e.g., on discord).

fkiraly · 2024-03-25T11:04:46Z

I was considering whether or not to define the statsmodel GLM object inside the fit function instead of the constructor (similar to the MapieRegressor to follow the template to the letter, but later decided against. I am wondering if that is the recommended option to implement the GaussianRegressor as well.

Yes, I think that's the right way to do it. Because GLM requires access to the data, and the earliest you see it is in _fit. So your initial intuition was the right one.

julian-fong · 2024-03-25T14:25:59Z

Hi @fkiraly, I've made the changes and added a couple more extra features (add_constant and get_test_params). Let me know if there are any more changes you would like me to implement.

julian-fong · 2024-03-25T14:36:00Z

One more thing, should I include the ability to add any family to be present i.e to include all the distribution families? I know in #7 it only said to implement the gaussian link

fkiraly · 2024-03-25T23:14:09Z

should I include the ability to add any family to be present i.e to include all the distribution families?

That would be great, though:

I'm not sure if all families have a corresponding distribution object already implemented
I'd suggest we make sure first this passes all check_estimator tests. Adding more families before that runs might complicate work on this, since it risks introducing a back/forth between parts of logic.

skpro/regression/linear/_glm.py

fkiraly

Great!

I left some comments above, most important is to get predict_proba right.

Also, you should test your model using skpro.utils.check_estimator, that utility allows you to quickly run relevant tests locally.

julian-fong · 2024-03-26T18:43:58Z

Hey @fkiraly,

I made some code updates, and I ran check_estimators from skpro.utils.estimator_checks locally and found one re-occuring error: NotImplementedError('no conversion defined from type pd_DataFrame_Table to pd_DataFrame'). I am wondering how I can fix this issue

fkiraly · 2024-03-30T09:25:16Z

Can you give me any insight as to what GLMRegressor-1 is and how it differs from GLMRegressor-0?

In get_test_params, you return a list of dict. This list, in the same order, is used to construct test instances, the first of which becomes GLMRegressor-0, the second GLMRegressr-1 (in python we start counting at 0).

These instances are then used in the tests from TestAllRegressors, substituted to the estimator_instance args.

So, if test_input_output_contract[GLMRegressor-1] fails, it means that estimator_instance = GLMRegressor(**params), where params is the second element in the list returned by get_test_params fails the test test_input_output_contract.

To get a full traceback for debugging, use raise_exceptions=True in check_estimator, you can also select individual tests by tests_to_run and fixtures_to_run.

For more details, see

https://www.sktime.net/en/latest/developer_guide/testing_framework.html

julian-fong · 2024-03-30T12:59:03Z

Thanks Franz! i believed i figured out what the source of the array mismatch was. In statsmodels theres an option to add a column of constants inside the GLM model if a user specifies it (the second parameter set specifies add_constant = True. In that case, it'll create an extra column called 'const' inside the pandas dataframe during training. When the tests call _predict_proba, since there is an extra column from _fit, it'll create an array mismatch because X_test does not account for that extra column

fkiraly · 2024-03-30T13:39:24Z

When the tests call _predict_proba, since there is an extra column from _fit, it'll create an array mismatch because X_test does not account for that extra column

That sounds like a bug in statsmodels, in get_predictions specifically, no?

Either way, if we know what the issue is, can we compensate for it by adding the column ourselves? (i assume it is an all-ones column)

We should also report the bug in statsmodels, if you can verify that it's indeed coming from get_predictions, and we should open a tracking issue that links to the statsmodels issue.

julian-fong · 2024-03-30T14:00:42Z

Yes, if the user specifies add_constant = True in initialization, we can compensate by adding it to the column ourselves if it does not exist already (this affects both the _predict and _predict_proba function). One thing to note is that in our docstring for both _predict and _predict_proba: its assumed that X_test has the same columns as X_train in fit

julian-fong · 2024-03-30T14:06:35Z

i did some testing with a toy example as follows:

import statsmodels.api as sm

# Create a sample dataset with two predictor variables (X1 and X2) and a response variable (y)
data = {
    'X1': [1, 2, 3, 4, 5],
    'X2': [2, 3, 4, 5, 6],
}

data2 = {
    'y': [3, 4, 5, 6, 7]
}
X = pd.DataFrame(data)
print(X)

# Add constant for intercept term
X = sm.add_constant(X)
print(X)
# Fit a GLM model

y = pd.DataFrame(data2)
model = sm.GLM(y, X, family=Gaussian())
result = model.fit()

# Print summary of the fitted model
# print(result.summary())

# Create a new observation for prediction
X_test = pd.DataFrame({'const': [1,1,1], 'X1': [6,2,4], 'X2': [7,2,5]})  # New observations with constant
X_test_ = pd.DataFrame({'X1': [6,2,4], 'X2': [7,2,5]}) #new observations without constant automatically set
# Predict with the model

print("-------predictions-------")
#will work
prediction = result.predict(X_test)
pred = result.get_prediction(X_test).summary_frame()
print(prediction)
print(pred)

#won't work
#prediction = result.predict(X_test_)
pred = result.get_prediction(X_test_).summary_frame()

# print("Prediction for new observation:", prediction)

y_test = pd.DataFrame(prediction, columns = y.columns)

A value error of ValueError: shapes (3,2) and (3,) not aligned: 2 (dim 1) != 3 (dim 0) is then returned

fkiraly · 2024-03-30T17:20:18Z

One thing to note is that in our docstring for both _predict and _predict_proba: its assumed that X_test has the same columns as X_train in fit

I mean, we do this only internally in _predict, to compensate for the bug in statsmodels until it is fixed.
That is, if add_constant=True, then make a copy of the df X where a column is added, and pass that (with the additional column) to statsmodels. The skpro user doesn't have to do that.

Does this make sense?

At the same time, I would (i.e., I'd suggest you do that) report the bug in statsmodels, with a reproducible example. If you want a review before that, you can post the bug report here first.

julian-fong · 2024-03-30T17:52:05Z

One thing to note is that in our docstring for both _predict and _predict_proba: its assumed that X_test has the same columns as X_train in fit

I mean, we do this only internally in _predict, to compensate for the bug in statsmodels until it is fixed. That is, if add_constant=True, then make a copy of the df X where a column is added, and pass that (with the additional column) to statsmodels. The skpro user doesn't have to do that.

Understood, but would we not want to leverage the same idea for _predict_proba since that method also requires adding a column 'const' if it is missing?

fkiraly · 2024-03-30T18:07:13Z

Understood, but would we not want to leverage the same idea for _predict_proba since that method also requires adding a column 'const' if it is missing?

Yes, of course - I would move the logic into a separate method, e.g., _prep_X, to avoid copy-pasting too much code.

…_constant is True

fkiraly · 2024-03-30T20:11:25Z

skpro/regression/linear/_glm.py

+        from statsmodels.genmod.generalized_linear_model import GLM
+        from statsmodels.tools import add_constant
+
+        if self.add_constant:


do we need to do this in _fit, too?

yes, i'll change the code so its consistent with the other methods

I thought statsmodels does it in fit?

As per my knowledge, the GLM model on statsmodels does not automatically include a constant column at any point during any fitting or prediction process

I see, so you added this feature?

Yes, I thought it was worth including as it was consistently used throughout all the tutorials and documentation I went over prior to writing the code

skpro/regression/linear/_glm.py

fkiraly · 2024-03-31T16:46:35Z

FYI, if you want to indicate that you're done with making changes and addressing the review comments, you can click here:

(the circling arrows)

Because, to me it looks like this is ready?

…ng/skpro into pr/222

fkiraly

Nice, no blocking comments.

Some comments which might still be worth addressing:

family can be picked by the user, but only Gaussian works at the moment, other choices will cause the estimator to break.
we should try to cover as many parameters as we can in get_test_params.
the docstring should say, for every parameter, what possible values are. E.g., what are possible values for cov_type, method, what are expected sizes for start_params, etc.

julian-fong · 2024-03-31T18:10:06Z

Sounds good! I'll keep my eye on this page and improve the code so that it is more robust when including other families. Thanks again for all the guidance!

fkiraly · 2024-03-31T18:37:28Z

let me know if you want to do it in this PR, or separately.

julian-fong · 2024-03-31T18:47:32Z

I'll make a seperate pr and include this pr so they are linked together (so this one doesn't get too congested with comments) once my GSoC proposal is submitted.

fkiraly · 2024-03-31T20:30:53Z

ok - I've removed the parameters that are broken or not useable. You can handle them later if you want.

@julian-fong

@julian-fong forgot to add their badges to `all-contributorsrc` for #222, this PR adds them. Also fixes an unrelated typo.

julian-fong added 5 commits March 23, 2024 15:14

added class name

b9a6343

updates

bebe835

wrote docstring

63a4684

implemented init, fit and predict

2e89bb2

implemented _predict_proba, fixed bugs

905386a

julian-fong marked this pull request as draft March 25, 2024 00:13

julian-fong changed the title ~~[ENH] Added Gaussian Regressor using statsmodels GLM with gaussian link~~ [ENH] - adding or improving code: Added Gaussian Regressor using statsmodels GLM with gaussian link Mar 25, 2024

julian-fong marked this pull request as ready for review March 25, 2024 00:13

fkiraly requested changes Mar 25, 2024

View reviewed changes

fkiraly added enhancement module:regression probabilistic regression module labels Mar 25, 2024

fkiraly added the interfacing algorithms Interfacing existing algorithms/estimators from third party packages label Mar 25, 2024

julian-fong added 3 commits March 25, 2024 09:47

made changes as per 222

f423be5

added parameter for intercept and implemented get_test_params

5700e29

minor change

cefb3db

julian-fong requested a review from fkiraly March 25, 2024 14:25

fkiraly reviewed Mar 25, 2024

View reviewed changes

skpro/regression/linear/_glm.py Outdated Show resolved Hide resolved

fkiraly reviewed Mar 25, 2024

View reviewed changes

skpro/regression/linear/_glm.py Outdated Show resolved Hide resolved

fkiraly reviewed Mar 25, 2024

View reviewed changes

skpro/regression/linear/_glm.py Outdated Show resolved Hide resolved

fkiraly reviewed Mar 25, 2024

View reviewed changes

skpro/regression/linear/_glm.py Show resolved Hide resolved

fkiraly reviewed Mar 25, 2024

View reviewed changes

skpro/regression/linear/_glm.py Outdated Show resolved Hide resolved

fkiraly requested changes Mar 25, 2024

View reviewed changes

julian-fong added 2 commits March 26, 2024 13:27

updated docstring, fixed issues

1ecda58

updated docstring, minor fixes

cac37ee

removed family parameter as we are only using the Gaussian link

a51cc5e

update test set params

52ea9bf

julian-fong added 2 commits March 30, 2024 15:49

wrote code to automatically append const column on inf methods if add…

8c2e1d4

…_constant is True

minor code change

05eb9d2

fkiraly reviewed Mar 30, 2024

View reviewed changes

skpro/regression/linear/_glm.py Outdated Show resolved Hide resolved

improved readability of code, moved if statement into _prep_x

d2e6dc6

fkiraly added 2 commits March 31, 2024 18:46

Merge branch 'main' into pr/222

813929e

Merge branch 'add_gaussian_regressor' of https://github.com/julian-fo…

b689a78

…ng/skpro into pr/222

fkiraly approved these changes Mar 31, 2024

View reviewed changes

fkiraly added 2 commits March 31, 2024 22:30

removed params

e6dff2c

Update _glm.py

0bd6f6f

fkiraly merged commit 91ca77a into sktime:main Mar 31, 2024
36 checks passed

fkiraly changed the title ~~[ENH] GaussianRegressor using statsmodels GLM with gaussian link~~ [ENH] GLMRegressor using statsmodels GLM with gaussian link Mar 31, 2024

fkiraly mentioned this pull request Mar 31, 2024

[ENH] further work on GLMRegressor interfacing statsmodels GLM #230

Closed

julian-fong deleted the add_gaussian_regressor branch April 2, 2024 19:49

fkiraly mentioned this pull request Apr 6, 2024

[DOC] add julian-fong to all-contributorsrc #238

Merged

fkiraly added a commit that referenced this pull request Apr 6, 2024

[DOC] add julian-fong to all-contributorsrc (#238)

bad4046

@julian-fong forgot to add their badges to `all-contributorsrc` for #222, this PR adds them. Also fixes an unrelated typo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] `GLMRegressor` using statsmodels `GLM` with gaussian link #222

[ENH] `GLMRegressor` using statsmodels `GLM` with gaussian link #222

julian-fong commented Mar 25, 2024

fkiraly left a comment •

edited

Loading

fkiraly commented Mar 25, 2024 •

edited

Loading

julian-fong commented Mar 25, 2024

julian-fong commented Mar 25, 2024

fkiraly commented Mar 25, 2024

fkiraly left a comment

julian-fong commented Mar 26, 2024

fkiraly commented Mar 30, 2024 •

edited

Loading

julian-fong commented Mar 30, 2024 •

edited

Loading

fkiraly commented Mar 30, 2024

julian-fong commented Mar 30, 2024 •

edited

Loading

julian-fong commented Mar 30, 2024 •

edited

Loading

fkiraly commented Mar 30, 2024 •

edited

Loading

julian-fong commented Mar 30, 2024

fkiraly commented Mar 30, 2024

fkiraly Mar 30, 2024

julian-fong Mar 30, 2024

fkiraly Mar 30, 2024

julian-fong Mar 30, 2024

fkiraly Mar 30, 2024

julian-fong Mar 30, 2024

fkiraly commented Mar 31, 2024

fkiraly left a comment

julian-fong commented Mar 31, 2024

fkiraly commented Mar 31, 2024

julian-fong commented Mar 31, 2024

fkiraly commented Mar 31, 2024

[ENH] GLMRegressor using statsmodels GLM with gaussian link #222

[ENH] GLMRegressor using statsmodels GLM with gaussian link #222

Conversation

julian-fong commented Mar 25, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

PR checklist

For all contributions

fkiraly left a comment • edited Loading

Choose a reason for hiding this comment

fkiraly commented Mar 25, 2024 • edited Loading

julian-fong commented Mar 25, 2024

julian-fong commented Mar 25, 2024

fkiraly commented Mar 25, 2024

fkiraly left a comment

Choose a reason for hiding this comment

julian-fong commented Mar 26, 2024

fkiraly commented Mar 30, 2024 • edited Loading

julian-fong commented Mar 30, 2024 • edited Loading

fkiraly commented Mar 30, 2024

julian-fong commented Mar 30, 2024 • edited Loading

julian-fong commented Mar 30, 2024 • edited Loading

fkiraly commented Mar 30, 2024 • edited Loading

julian-fong commented Mar 30, 2024

fkiraly commented Mar 30, 2024

fkiraly Mar 30, 2024

Choose a reason for hiding this comment

julian-fong Mar 30, 2024

Choose a reason for hiding this comment

fkiraly Mar 30, 2024

Choose a reason for hiding this comment

julian-fong Mar 30, 2024

Choose a reason for hiding this comment

fkiraly Mar 30, 2024

Choose a reason for hiding this comment

julian-fong Mar 30, 2024

Choose a reason for hiding this comment

fkiraly commented Mar 31, 2024

fkiraly left a comment

Choose a reason for hiding this comment

julian-fong commented Mar 31, 2024

fkiraly commented Mar 31, 2024

julian-fong commented Mar 31, 2024

fkiraly commented Mar 31, 2024

[ENH] `GLMRegressor` using statsmodels `GLM` with gaussian link #222

[ENH] `GLMRegressor` using statsmodels `GLM` with gaussian link #222

fkiraly left a comment •

edited

Loading

fkiraly commented Mar 25, 2024 •

edited

Loading

fkiraly commented Mar 30, 2024 •

edited

Loading

julian-fong commented Mar 30, 2024 •

edited

Loading

julian-fong commented Mar 30, 2024 •

edited

Loading

julian-fong commented Mar 30, 2024 •

edited

Loading

fkiraly commented Mar 30, 2024 •

edited

Loading