Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical support design proposal #36

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions steps/24_categorical_support/step.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, this is the case:

  • design 1 requires users to specifies which features are categorical.
  • design 3 requires users to encode externally.

So, a general question: is design 2 the only design to make use of feature kind PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have clarified about design 3 in #36 (comment). Please check it out.

Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Adding categorical support to sktime

Contributors: @Abhay-Lejith

## Introduction

### Objectives
Allow users to pass data with categorical features to any estimator. If not supported, return an informative error message. Otherwise it should run without errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This objective already assumes a design. "if not supported, ..."

Try to separate objective from solutioning, please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't notice, will change it.


### Current status:
* Non-numeric dtypes are completely blocked off in datatype checks.
* https://github.com/sktime/sktime/pull/5886 - This pr removes these checks but has not been merged yet as most estimators do not support categorical and will break without informative error messages.
* https://github.com/sktime/sktime/pull/6490 - pr adding feature_kind metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest, start with outlining a conceptual model. Explain what kind of (abstract) estimators there are, and what you want to model. Then proceed to example code that should run, try to cover different cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this 'conceptual model' heavily depend on which design we choose to go ahead with. Or do you want me to do this for all the proposed designs?

Explain what kind of (abstract) estimators there are, and what you want to model

I don't think there are any new estimators being made, changes will have to be made to existing classes and again, the change will depend on what design we go for.


## Description of proposed designs


For each design we will have to discuss 8 combinations of cases:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to give more context to the reader, what is the context/setting? no.2 in particular may be unclear without

1. estimator natively supports categorical: yes/no
2. categorical feature specified by user: yes/no
3. categorical present in data: yes/no

## Specifying categorical features:
- The reason I proposed this is in case the user wants to treat a column with numerical values as categorical. This will not be detected by feature_kind as it relies only on the dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"i proposed this" - missing context. Try to separate solutions from requirements, and ensure the document contains context before you reference the context.

- More on how this is tackled in each design below.
---
---


### Design 1. Allow categorical features ONLY in those estimators which support natively. (no encoding internally)

- `categorical_features` parameter will be present only in estimators that support natively.
- For estimators that do not support natively, users can make a pipeline with an encoder of their choice and use categorical variables. This is the general ml workflow, like how it's done in sklearn too.

#### How tag is going to be used
- in datatype checks :
```
if categorical support tag is false:
if categorical feature is present:
error - cat feature is not supported by this estimator
```

- In this case `categorical feature` will not be a parameter in models that do not support. So (NO,YES,_) is not possible.

| **Estimator supports categorical natively** | **Cat feature specified by user** | **Cat feature present in data** | **Outcome** |
|:-----------------:|:-------------------------:|:--------------:|:---------------:|
| NO | NO | NO | normal case, proceeds as usual without error. |
| NO | NO | YES| informative error message to user that categorical feature is not supported by the estimator |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts: should it warn and then work by dropping specified column(s)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a different design, there are multiple options - encode, drop, or fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned as "thoughts" to suggest to @Abhay-Lejith to consider this as another possibility (albeit not very desirable) too to do pros and cons. The intention seems to be miscommunicated.

| NO | YES| NO | X |
Abhay-Lejith marked this conversation as resolved.
Show resolved Hide resolved
| NO | YES| YES| X |
Abhay-Lejith marked this conversation as resolved.
Show resolved Hide resolved
| YES| NO | NO | normal case, proceeds as usual without error. |
| YES| NO | YES| use categorical columns as detected by feature_kind and proceed. |
| YES| YES| NO | informative error message that specified column was not present in data.|
| YES| YES| YES| use categorical columns as detected by feature_kind along with the user specified columns(union) and proceed. |

- in case where specified column was not found, we can also choose to ignore and continue with warning that column was not found.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: what do you mean by "continue" here?

For example, let's consider ARIMA for revenue forecasting and we have categorical feature of user rating. What will happen?

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina This is design 1 which allows categorical only in estimators which support natively. ARIMA does not support natively, so it would just fail saying categorical not supported.
Also, in the other comments, could you also mention which line number you are referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, if we have a different approach, I would suggest you record it and try to structure the different choices one can make here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly , the document has three major headings that say Design 1, Design 2 and Design 3. Is it really not clear?


#### estimators that can be added:
- xgboost, lightgbm and catboost are some models that have native support for categorical. But method of passing categorical features varies quite a lot between them (I've commented on this in umbrella issue - https://github.com/sktime/sktime/issues/6109). Making a general interface might be challenging.
- So it might be better to interface them in separate classes inheriting from reduction forecaster class. (similar to darts lightgbm, see - https://unit8co.github.io/darts/generated_api/darts.models.forecasting.lgbm.html)

#### PROS:
- No internal encoding, comparatively easier to implement.
- No major changes to core logic.
- Follows general ML workflow i.e if estimator does not support, then user must preprocess their data(including encoding) and then use it.
#### CONS:
- Less number of estimators support categorical natively.


---
---

### Design 2. Perform one hot encoding if not natively supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes: definitely can be default choice but should be configurable per column and/or overall.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really required though? Encoders exist for this very purpose and are available to users. Why do we have to provide all that functionality inbuilt? In sklearn, there is nothing of this sort. Users can just preprocess the data and then use it in estimators.

Copy link
Member

@yarnabrina yarnabrina Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it is really required.

  • one-hot encoder is not applibale or suitable always.
    • variables with a lot of levels may create too many features, which can cause issues.
    • ordinal variables will loose complete order.
  • if encoded outside, there will not be any possibility to use native support from the estimator when available in a pipeline step, and use encoding otherwise etc. I asked the pipeline question very much for this reason.
  • since one hot encoder change the dimension of the data, it can actually be not supported in some cases or be complicated due to the collinearity it usually adds (unless you drop one column).

I don't think we should not do a direct comparison with sklearn in planning everything about sktime. They are not the same package, and the capabilities they provide are also not at same level mostly (sklearn implements all their algorithms, in most places sktime framework gives unified format from different underlying packages). This is very likely opinionated, and @fkiraly and other core developers may not have the same view as well. But giving user the choice should be preferred over fixing something in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course. I agree that providing all options available is better than a single fixed encoding option.

I meant, is the internal encoding necessary . And why?

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But giving user the choice should be preferred over fixing something in my opinion.

Precisely, and they have full choice on what columns to encode, what encoder to use and they can do it externally. I don't quite get why we want to make such major changes to sktime only to provide what can already be done.

Trying to cram such a large functionality into an inbuilt feature will prove to be very difficult. We would need to allow user to specify which columns to encode, which encoders to use, which encoder to use on which column etc.

On the other hand, if we fix a default encoding type, then in most cases the user might have to resort to external encoding anyway. So the internal encoding is not very useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it's a complex change, there is no doubt about that at all. But that can't be the reason to not even consider in planning design phase, complexity of the task is why it was considered for full time summer project instead of open source contributors adding parts of it based on their availability.

if encoded outside, there will not be any possibility to use native support from the estimator when available in a pipeline step, and use encoding otherwise etc. I asked the pipeline question very much for this reason.

I think you did not consider this case, please correct me if I am wrong. I'll give a possible flow below, this is similar to @benHeid 's advanced non-linear pipeline.

Suppose a forecasting pipeline has 3 steps: first it predicts few of the future-unknown X variables using other future-known X variables, then their union is used to predict y. The forecasters used in the two steps need not be same, and one may have native support for categorical features and other may not. In that case, it's perfectly reasonable for user to want to take benefit of native support when available and use encoding when not.

I am visualising that in this case we need to encode only for the forecaster that does not have native support, not for the other one. If there are more estimators in the pipeline where same variable(s) need to be encoded, reuse same trained encoder there too instead of training again and again. This does not have to be the only way, and there can be better methods, but as of now I am not sure if this can be handled if the encoding of categorical features happen outside of sktime pipeline by users directly.


- `categorical_feature` param will be present in all models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts: if this is available as a configuration instead of as an argument, will that be feasible? If so, will it help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhay-Lejith @fkiraly any thoughts about this comment?


#### How tag is going to be used:
```
if cat support tag is false:
encode cat features and continue
else:
continue
```

| **Estimator supports categorical natively** | **Cat feature specified by user** | **Cat feature present in data** | **Outcome** |
|:-----------------:|:-------------------------:|:--------------:|:---------------:|
| NO | NO | NO | normal case, proceeds as usual without error. |
| NO | NO | YES| encode features internally as detected by feature_kind and continue.
| NO | YES| NO | error that specified column was not present in data |
| NO | YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed to encode |
| YES| NO | NO | normal case, proceeds without error.|
| YES| NO | YES| use categorical features detected by feature_kind and proceed|
| YES| YES| NO | error that specified column was not present in data|
| YES| YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed|

#### Estimators that will be affected:
- all estimators that take exog input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: are we limiting the scope of categorical support to only forecasters and to only X?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think we can start with just X. For y, there will be complications I feel. One major one I can think of immediately is univariate becoming multivariate on one-hot encoding.
We can of course extend to y later I suppose.

Copy link
Contributor

@fkiraly fkiraly Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major one I can think of immediately is univariate becoming multivariate on one-hot encoding.

I think you making leaps of argumentation here, @Abhay-Lejith - you keep presupposing the "one hot encoding" idea, and silently assuming that is the only way to deal with categorical data.

What I mean is, when actively asked you are certainly aware that this is not the case, but your "intuitive thought processes" still seem to assume it, as the above shows for me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I just brought up a certain case where if the user were to one-hot encode y, it could cause issues. I never said it is the only case. Is anything wrong in this?


#### PROS:
- Most estimators will be able to accept categorical inputs.
- If user only wants to one-hot encode, this design is easier as user does not have to do anything extra.(encoding is done internally)

#### CONS:
- Will be more complex to implement, may require significant changes to core logic (for internal encoding).
- adding `cat_features` parameter to all estimators is not desired
- Not flexible in choice of encoding (label,one-hot,etc). If user wants different types, will have to resort to externallly doing it anyway.

---
---

### Design 3. Same as design 2 but NO `cat_features` parameter in model
Instead, user is required to convert the required numerical features to categorical. To ease the process, we can make a `transformer` to make this conversion.

- no `cat_features` parameter in models.

| **Estimator supports categorical natively** | **Cat feature present in data** | **Outcome** |
|:-------------------------------------------:|:-------------------------------:|:---------------:|
| NO | NO | normal case, proceeds as usual without error. |
| NO | YES| encode features internally as detected by feature_kind and continue.
| YES| NO | normal case, proceeds without error.|
| YES| YES| use categorical features detected by feature_kind and proceed|

#### PROS:
- Most estimators will be able to accept categorical inputs.
- If user only wants to one-hot encode, this design is easier as user does not have to do anything extra.(encoding is done internally)
- Removes need for adding extra parameter to all models.

#### CONS:
- Will be more complex to implement, may require significant changes to core logic (for internal encoding).
- Not flexible in choice of encoding (label,one-hot,etc). If user wants different types, will have to resort to externally doing it anyway.
- Removing parameter comes at cost of requiring user to convert numerical to categorical themselves.
- If this is the case then user can just convert categorical to numerical and use the data. Providing internal encoding is kind of redundant.

---
---

## Personal preference -
- Design 1:
- I see no need for the encoding to be done internally. Preprocessing data is a regular part of the ml workflow and encoding categorical features is part of it. If users want to use cat features in estimators that do not support natively, then they can encode it themselves externally or in pipelines and then pass it to the estimator. This is how it is everywhere else, including sklearn.
- Additionally, even if we are providing internal encoding, it is not flexible and provides only a single option to the user. If user wants to use different encoding on diff columns, will have to resort to doing it externally anyway.