Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create feature names for non-pandas input, too #655

Closed
wants to merge 13 commits into from

Conversation

MartinStancsicsQC
Copy link
Contributor

@MartinStancsicsQC MartinStancsicsQC commented Jun 20, 2023

Checklist

  • Added a CHANGELOG.rst entry

Main changes

  • The feature_names_ attribute is populated for non-pandas input, too. It uses the X_{i} format for numerical columns, and (in the case of a SplitMatrix) the C_{i}__{cat} format for categorical columns. It would be a breaking change for users who relied on the feature_names_ attribute being unassigned in the case of non-pandas input, but I cannot imagine how that would be usefu (you never know, though).
  • A term_names_ attribute is also assigned during fit. It contains the name of the column in the original input data that corresponds to each column of the design matrix. In the case of numerical variables, it is the same as feature_names_. In the case of categorical columns, it differs in that it does not include the category at the end.

Reason

A separate PR for Wald-tests will soon follow, for which it will be useful to have a simple way to refer to columns by names. The same goes for what I call terms here. I am submitting this as a separate PR as it is relatively self-contained and it might be easier to review.

src/glum/_glm.py Outdated
return X, y, sample_weight, offset, weights_sum, P1, P2

def _get_feature_names(self, X: ArrayLike):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this method get a different name, please? I associated "get" with a low overhead getter (i.e. just return an attribute). This method doesn't return anything and it actually sets two fitted attributes.

I would think it's a bit cleaner if this method just returned the feature names and terms and leave setting them to _set_up_and_check_fit_args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, totally agreed. Does this seem better?

@MartinStancsicsQC
Copy link
Contributor Author

The proposed tabmat formula interface has a categorical_format attribute that allows parametrizing categorical column names with formula strings (i.e. "{name}__{category}" or "{name}[{category}]". I think this feature it would be nice for glum's feature_names, too. Any thoughts?

@MatthiasSchmidtblaicherQC
Copy link
Contributor

Re. parametrizing categorical column names: I agree that this is a nice feature, but this should ideally be a separate PR.

Copy link
Member

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment is about the custom handling of different tabmat types. Ideally this is moved into tabmat. Happy to discuss.

"""
Get feature names for the input data.

Stores them in the ``feature_names_`` and ``column_names_`` attributes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Stores them in the ``feature_names_`` and ``column_names_`` attributes.
Stores them in the ``feature_names_`` and ``term_names_`` attributes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this function doesn't store them. We should revise the docstring.

feature_names.append(column)
term_names.append(column)

elif isinstance(X, tm.StandardizedMatrix):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to minimize the amount of custom logic depending on the tabmat type. Could you instead create a new method in tabmat to deal with this? I think it would actually be very useful metadata to store in the tabmat object itself.

@MartinStancsicsQC MartinStancsicsQC marked this pull request as draft July 19, 2023 06:59
@MartinStancsicsQC
Copy link
Contributor Author

It will probably be superseded by Tabmat PR 278. Closing it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants