Create feature names for non-pandas input, too #655

MartinStancsicsQC · 2023-06-20T15:18:29Z

Checklist

Added a CHANGELOG.rst entry

Main changes

The feature_names_ attribute is populated for non-pandas input, too. It uses the X_{i} format for numerical columns, and (in the case of a SplitMatrix) the C_{i}__{cat} format for categorical columns. It would be a breaking change for users who relied on the feature_names_ attribute being unassigned in the case of non-pandas input, but I cannot imagine how that would be usefu (you never know, though).
A term_names_ attribute is also assigned during fit. It contains the name of the column in the original input data that corresponds to each column of the design matrix. In the case of numerical variables, it is the same as feature_names_. In the case of categorical columns, it differs in that it does not include the category at the end.

Reason

A separate PR for Wald-tests will soon follow, for which it will be useful to have a simple way to refer to columns by names. The same goes for what I call terms here. I am submitting this as a separate PR as it is relatively self-contained and it might be easier to review.

in the non-pandas case

I still think it should be a waring, but first some downstream changes are in order in tabmat so that it can handle this case gracefully.

CHANGELOG.rst

src/glum/_glm.py

jtilly · 2023-06-22T08:23:43Z

src/glum/_glm.py

        return X, y, sample_weight, offset, weights_sum, P1, P2

+    def _get_feature_names(self, X: ArrayLike):


Could this method get a different name, please? I associated "get" with a low overhead getter (i.e. just return an attribute). This method doesn't return anything and it actually sets two fitted attributes.

I would think it's a bit cleaner if this method just returned the feature names and terms and leave setting them to _set_up_and_check_fit_args.

Thanks, totally agreed. Does this seem better?

MartinStancsicsQC · 2023-06-30T13:58:13Z

The proposed tabmat formula interface has a categorical_format attribute that allows parametrizing categorical column names with formula strings (i.e. "{name}__{category}" or "{name}[{category}]". I think this feature it would be nice for glum's feature_names, too. Any thoughts?

MatthiasSchmidtblaicherQC · 2023-07-03T06:44:27Z

Re. parametrizing categorical column names: I agree that this is a nice feature, but this should ideally be a separate PR.

MarcAntoineSchmidtQC

My main comment is about the custom handling of different tabmat types. Ideally this is moved into tabmat. Happy to discuss.

MarcAntoineSchmidtQC · 2023-07-07T19:28:24Z

src/glum/_glm.py

+        """
+        Get feature names for the input data.
+
+        Stores them in the ``feature_names_`` and ``column_names_`` attributes.


Suggested change

Stores them in the ``feature_names_`` and ``column_names_`` attributes.

Stores them in the ``feature_names_`` and ``term_names_`` attributes.

Actually, this function doesn't store them. We should revise the docstring.

MarcAntoineSchmidtQC · 2023-07-07T19:32:29Z

src/glum/_glm.py

+                    feature_names.append(column)
+                    term_names.append(column)
+
+        elif isinstance(X, tm.StandardizedMatrix):


We should try to minimize the amount of custom logic depending on the tabmat type. Could you instead create a new method in tabmat to deal with this? I think it would actually be very useful metadata to store in the tabmat object itself.

MartinStancsicsQC · 2023-07-19T07:03:19Z

It will probably be superseded by Tabmat PR 278. Closing it for now.

stanmart added 10 commits June 20, 2023 15:39

Add feature names for non-pandas inputs, too

f9dc543

Move new method to GeneralizedLinearRegressorBase

b83b286

Adjust and extend tests

b61d82a

Move _get_feature_names to after input validation

5ee9764

in the non-pandas case

Do not raise on one singleton categories

9cc86a9

Rename attribute

85c4df0

Add more tests

2f52b2a

Add changelog entry

fddb640

Raise on singleton categorical columns

d94c136

I still think it should be a waring, but first some downstream changes are in order in tabmat so that it can handle this case gracefully.

Add unreachable case to _get_feature_names

2600331

MartinStancsicsQC requested review from tbenthompson, MarcAntoineSchmidtQC, xhochy, jtilly and lbittarello as code owners June 20, 2023 15:18

MatthiasSchmidtblaicherQC reviewed Jun 21, 2023

View reviewed changes

CHANGELOG.rst Outdated Show resolved Hide resolved

MatthiasSchmidtblaicherQC reviewed Jun 21, 2023

View reviewed changes

src/glum/_glm.py Outdated Show resolved Hide resolved

jtilly reviewed Jun 22, 2023

View reviewed changes

stanmart added 2 commits June 23, 2023 16:00

Return result instead of mutating object

fee4eec

Correct changelog [skip ci]

4d8a57f

Make categorical variable format customizable

e7584d5

MarcAntoineSchmidtQC reviewed Jul 7, 2023

View reviewed changes

MartinStancsicsQC marked this pull request as draft July 19, 2023 06:59

MartinStancsicsQC closed this Jul 19, 2023

MartinStancsicsQC mentioned this pull request Aug 2, 2023

Add methods for performing Wald-tests #668

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create feature names for non-pandas input, too #655

Create feature names for non-pandas input, too #655

MartinStancsicsQC commented Jun 20, 2023 •

edited

Loading

jtilly Jun 22, 2023

MartinStancsicsQC Jun 23, 2023

MartinStancsicsQC commented Jun 30, 2023

MatthiasSchmidtblaicherQC commented Jul 3, 2023

MarcAntoineSchmidtQC left a comment

MarcAntoineSchmidtQC Jul 7, 2023

MarcAntoineSchmidtQC Jul 7, 2023

MarcAntoineSchmidtQC Jul 7, 2023

MartinStancsicsQC commented Jul 19, 2023

		return X, y, sample_weight, offset, weights_sum, P1, P2

		def _get_feature_names(self, X: ArrayLike):

	Stores them in the ``feature_names_`` and ``column_names_`` attributes.
	Stores them in the ``feature_names_`` and ``term_names_`` attributes.

Create feature names for non-pandas input, too #655

Create feature names for non-pandas input, too #655

Conversation

MartinStancsicsQC commented Jun 20, 2023 • edited Loading

Main changes

Reason

jtilly Jun 22, 2023

Choose a reason for hiding this comment

MartinStancsicsQC Jun 23, 2023

Choose a reason for hiding this comment

MartinStancsicsQC commented Jun 30, 2023

MatthiasSchmidtblaicherQC commented Jul 3, 2023

MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

MarcAntoineSchmidtQC Jul 7, 2023

Choose a reason for hiding this comment

MarcAntoineSchmidtQC Jul 7, 2023

Choose a reason for hiding this comment

MarcAntoineSchmidtQC Jul 7, 2023

Choose a reason for hiding this comment

MartinStancsicsQC commented Jul 19, 2023

MartinStancsicsQC commented Jun 20, 2023 •

edited

Loading