Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical support design proposal #36

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Abhay-Lejith
Copy link

@fkiraly @yarnabrina Here is the enhancement proposal for categorical support.

Copy link
Member

@yarnabrina yarnabrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few comments. One additional one here: have you planned pipeline cases?

For simplicity, let's consider sequential pipelines. One earlier step may use an estimator that support categorical features natively while another afterwards step may not, and vice versa. What will happen in those cases?

steps/24_categorical_support/step.md Show resolved Hide resolved
| **Estimator supports categorical natively** | **Cat feature specified by user** | **Cat feature present in data** | **Outcome** |
|:-----------------:|:-------------------------:|:--------------:|:---------------:|
| NO | NO | NO | normal case, proceeds as usual without error. |
| NO | NO | YES| informative error message to user that categorical feature is not supported by the estimator |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts: should it warn and then work by dropping specified column(s)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a different design, there are multiple options - encode, drop, or fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned as "thoughts" to suggest to @Abhay-Lejith to consider this as another possibility (albeit not very desirable) too to do pros and cons. The intention seems to be miscommunicated.

steps/24_categorical_support/step.md Show resolved Hide resolved
| YES| YES| NO | informative error message that specified column was not present in data.|
| YES| YES| YES| use categorical columns as detected by feature_kind along with the user specified columns(union) and proceed. |

- in case where specified column was not found, we can also choose to ignore and continue with warning that column was not found.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: what do you mean by "continue" here?

For example, let's consider ARIMA for revenue forecasting and we have categorical feature of user rating. What will happen?

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina This is design 1 which allows categorical only in estimators which support natively. ARIMA does not support natively, so it would just fail saying categorical not supported.
Also, in the other comments, could you also mention which line number you are referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, if we have a different approach, I would suggest you record it and try to structure the different choices one can make here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly , the document has three major headings that say Design 1, Design 2 and Design 3. Is it really not clear?

---
---

### Design 2. Perform one hot encoding if not natively supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes: definitely can be default choice but should be configurable per column and/or overall.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really required though? Encoders exist for this very purpose and are available to users. Why do we have to provide all that functionality inbuilt? In sklearn, there is nothing of this sort. Users can just preprocess the data and then use it in estimators.

Copy link
Member

@yarnabrina yarnabrina Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it is really required.

  • one-hot encoder is not applibale or suitable always.
    • variables with a lot of levels may create too many features, which can cause issues.
    • ordinal variables will loose complete order.
  • if encoded outside, there will not be any possibility to use native support from the estimator when available in a pipeline step, and use encoding otherwise etc. I asked the pipeline question very much for this reason.
  • since one hot encoder change the dimension of the data, it can actually be not supported in some cases or be complicated due to the collinearity it usually adds (unless you drop one column).

I don't think we should not do a direct comparison with sklearn in planning everything about sktime. They are not the same package, and the capabilities they provide are also not at same level mostly (sklearn implements all their algorithms, in most places sktime framework gives unified format from different underlying packages). This is very likely opinionated, and @fkiraly and other core developers may not have the same view as well. But giving user the choice should be preferred over fixing something in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course. I agree that providing all options available is better than a single fixed encoding option.

I meant, is the internal encoding necessary . And why?

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But giving user the choice should be preferred over fixing something in my opinion.

Precisely, and they have full choice on what columns to encode, what encoder to use and they can do it externally. I don't quite get why we want to make such major changes to sktime only to provide what can already be done.

Trying to cram such a large functionality into an inbuilt feature will prove to be very difficult. We would need to allow user to specify which columns to encode, which encoders to use, which encoder to use on which column etc.

On the other hand, if we fix a default encoding type, then in most cases the user might have to resort to external encoding anyway. So the internal encoding is not very useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it's a complex change, there is no doubt about that at all. But that can't be the reason to not even consider in planning design phase, complexity of the task is why it was considered for full time summer project instead of open source contributors adding parts of it based on their availability.

if encoded outside, there will not be any possibility to use native support from the estimator when available in a pipeline step, and use encoding otherwise etc. I asked the pipeline question very much for this reason.

I think you did not consider this case, please correct me if I am wrong. I'll give a possible flow below, this is similar to @benHeid 's advanced non-linear pipeline.

Suppose a forecasting pipeline has 3 steps: first it predicts few of the future-unknown X variables using other future-known X variables, then their union is used to predict y. The forecasters used in the two steps need not be same, and one may have native support for categorical features and other may not. In that case, it's perfectly reasonable for user to want to take benefit of native support when available and use encoding when not.

I am visualising that in this case we need to encode only for the forecaster that does not have native support, not for the other one. If there are more estimators in the pipeline where same variable(s) need to be encoded, reuse same trained encoder there too instead of training again and again. This does not have to be the only way, and there can be better methods, but as of now I am not sure if this can be handled if the encoding of categorical features happen outside of sktime pipeline by users directly.

| YES| YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed|

#### Estimators that will be affected:
- all estimators that take exog input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: are we limiting the scope of categorical support to only forecasters and to only X?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think we can start with just X. For y, there will be complications I feel. One major one I can think of immediately is univariate becoming multivariate on one-hot encoding.
We can of course extend to y later I suppose.

Copy link
Contributor

@fkiraly fkiraly Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One major one I can think of immediately is univariate becoming multivariate on one-hot encoding.

I think you making leaps of argumentation here, @Abhay-Lejith - you keep presupposing the "one hot encoding" idea, and silently assuming that is the only way to deal with categorical data.

What I mean is, when actively asked you are certainly aware that this is not the case, but your "intuitive thought processes" still seem to assume it, as the above shows for me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I just brought up a certain case where if the user were to one-hot encode y, it could cause issues. I never said it is the only case. Is anything wrong in this?


### Design 2. Perform one hot encoding if not natively supported.

- `categorical_feature` param will be present in all models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts: if this is available as a configuration instead of as an argument, will that be feasible? If so, will it help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhay-Lejith @fkiraly any thoughts about this comment?

@Abhay-Lejith
Copy link
Author

One additional one here: have you planned pipeline cases?

No, I have not gone into the pipeline part of things yet.

## Introduction

### Objectives
Allow users to pass data with categorical features to any estimator. If not supported, return an informative error message. Otherwise it should run without errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This objective already assumes a design. "if not supported, ..."

Try to separate objective from solutioning, please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't notice, will change it.

## Description of proposed designs


For each design we will have to discuss 8 combinations of cases:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to give more context to the reader, what is the context/setting? no.2 in particular may be unclear without

3. categorical present in data: yes/no

## Specifying categorical features:
- The reason I proposed this is in case the user wants to treat a column with numerical values as categorical. This will not be detected by feature_kind as it relies only on the dtype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"i proposed this" - missing context. Try to separate solutions from requirements, and ensure the document contains context before you reference the context.

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start!

My top level comment is that you should start with the problem and requirements for a solution, before talking about solutions, or assuming solutions.

For instance, example requirements:

  • looping over estimators should be easy for the user and follow the strategy pattern as much as possible (exchangeability)
  • informative error messages are always raised in case of failures

Further comment, this is a general one that applies in many places: try to give context before you discuss opinions or solutions. For instance, what is the problem you are solving? Or, what are expected inputs or behaviours?

@Abhay-Lejith
Copy link
Author

@fkiraly @yarnabrina , I've updated the proposal and worked on your feedback. Please take a look.

Allow users to pass exog data with categorical features to any forecaster. Outcome will depend on whether the forecaster natively supports categorical or not.
Two broadly possible outcomes:
- error in fit saying not natively supported.
- internally encode and fit successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need to encode always.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not natively supported, above two are the only options right?



### Models which support categorical natively that have been identified so far:
- xgboost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this compare to the list in
sktime/sktime#6109 (comment)
and subsequent discussion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is of forecasters with native support. And I am currently looking at only categorical in exog.

This information cannot be detected using the feature_kind metadata. It has to be passed by the user in some manner.

- Possible solutions:
- 'categorical_features' arg in model where user can specify such columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more solutions:

  • the user converts the data outside sktime, pandas does have this feature, for instance
  • a pipeline element that does that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line L-86 is the same right?

  • not deal with this case and expect user to convert from numerical to categorical. We can make a transformer for this purpose.

- not deal with this case and expect user to convert from numerical to categorical. We can make a transformer for this purpose.

### Req 2
We will use a tag `categorical_support`(=True/False) or similar to identify whether a forecaster has native support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a capability tag, so in line with most such tags it would be capability:something

- Set tag depending on model used in reduction.
- Since there are few models(3) that have been identified with native support, a hard coded method like an if/else statement to check the model could be used.
- Would be more difficult to extend if new estimators with native support are found.
- Maintain some sort of list internally in sktime, of estimators with native support and check with this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further option: try/except with a minuscule dataset, can be combined with a "known" list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn also has tags, is there perhaps one for this?

1. Need to know whether categorical data is present in X_train and which columns.
2. Need to know if forecaster has native categorical support.
3. In reduction, we need to know if model used has native categorical support.
4. In reduction, we need to pass parameters(like `enable_categorical=True` in case of xgboost) to the model used if natively supported according to the requirements of the model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is not a requirement but a solution. Try to separate these clearly. I do not see at least why this needs to be the case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a requirement. xgboost for instance 'requires' this parameter to be passed to it when using categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, "specification requirement" - what are requirements towards the piece of software that you will implement. That´s a specifc use of the word: https://en.wikipedia.org/wiki/Software_requirements

y_pred = forecaster.predict(X_test,fh=[1,2])
```

## Requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start with high-level requirements, e.g., what does the API need to do? Then proceed to driving lower level ones. Try to avoid mixing up requirements and solutions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand. Can you give me an example of a requirement.
I don't think any of the below points are solutions. I have listed the requirements below and then discussed solutions to each separately in the next section.

* Non-numeric dtypes are completely blocked off in datatype checks.
* https://github.com/sktime/sktime/pull/5886 - This pr removes these checks but has not been merged yet as most estimators do not support categorical and will break without informative error messages.
* https://github.com/sktime/sktime/pull/6490 - pr adding feature_kind metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest, start with outlining a conceptual model. Explain what kind of (abstract) estimators there are, and what you want to model. Then proceed to example code that should run, try to cover different cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this 'conceptual model' heavily depend on which design we choose to go ahead with. Or do you want me to do this for all the proposed designs?

Explain what kind of (abstract) estimators there are, and what you want to model

I don't think there are any new estimators being made, changes will have to be made to existing classes and again, the change will depend on what design we go for.

Comment on lines 187 to 205
### Design 3. Same as design 2 but NO `cat_features` parameter in model
Instead, user is required to convert the required numerical features to categorical. To ease the process, we can make a `transformer` to make this conversion.

- no `cat_features` parameter in models.

| **Estimator supports categorical natively** | **Cat feature present in data** | **Outcome** |
|:-------------------------------------------:|:-------------------------------:|:---------------:|
| NO | NO | normal case, fits without error. |
| NO | YES| encode features internally as detected by feature_kind and continue to fit.
| YES| NO | normal case, fits without error.|
| YES| YES| use categorical features detected by feature_kind and fit|

#### PROS:
- All forecasters will be able to accept categorical inputs.

#### CONS:
- Not flexible in choice of encoding (label,one-hot,etc) if we fix a default. If user wants different types, will have to resort to externallly doing it anyway.
- Removing parameter comes at cost of requiring user to convert numerical to categorical themselves.
- If this is the case then user can just convert categorical to numerical and use the data. Providing internal encoding is kind of redundant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused about design 3. L188 and L204 mentions that user is required to own the encoding, but L195 and L203 suggests we are encoding with a fixed default encoding method. Can you please help me understand what exactly is the plan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina the encoding the user needs to do here is from 'numerical' to 'categorical' for the cases where they want to treat a numerical column as categorical feature. (because we cannot detect these by feature_kind) . Internal encoding is from categorical to numerical and is done by a fixed default. Is it clear now?

- This is basically trying to provide the entire encoding ecosystem internally.
- Will be very difficult to implement. Where will the user specify all this data?

- I don't think this is a good idea.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing your opinion, but just to understand better can you please tell whether you think it's a bad idea or it's a complex idea? These are not necessarily same.

If it's the first one, viz. bad idea, can I request you to explain why? I've shared my reasoning with respect to multi-step orchestration process above, would like to know yours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I do not understand what concretely design 4 is, so I cannot form an opinion due to lack of specificity and information.

What would help me a lot is if in all cases, the scenario "data has cateorical, estimator does not support" is concretely spelled out. How does the speculative code look like, what happens internally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I have mentioned (L-210) that it is the same as Design 2 or 3, but we are extending it by trying to provide more encoding options to the user.
So in the scenario "data has categorical, estimator does not support", it will encode the categorical features and fit.
I haven't gone into more detail of how it will look yet as I was not sure about how we would allow the user to specify all this data.

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina , sorry about bluntly saying it's a bad idea without any explanation.
But my thought was that doing so much only to provide an existing functionality under the hood seemed unnecessary to me. Also, I did not think of the set_config idea you proposed, so I was also wondering how we would let the user specify so much data.
Edit: I was thinking ColumnTransformer but I see that sktime has its own version as ColumnEnsembleTransformer(as noted by @fkiraly as well)


### Design 2. Perform one hot encoding if not natively supported.

- `categorical_feature` param will be present in all models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhay-Lejith @fkiraly any thoughts about this comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, this is the case:

  • design 1 requires users to specifies which features are categorical.
  • design 3 requires users to encode externally.

So, a general question: is design 2 the only design to make use of feature kind PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have clarified about design 3 in #36 (comment). Please check it out.

Comment on lines 10 to 11
- error in fit saying not natively supported.
- internally encode and fit successfully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between L10 and L11, is it not considered a valid option?

"Fit/predict by dropping/ignoring non-numerical features"

It's probably more useful than complete failure, and can possibly be combined with other designs, especially design 1 to increase scope to all forecasters.

I did comment this earlier as well, don't know what you think about this @Abhay-Lejith @fkiraly .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a possible solution. I am updating the design document once again and will add this.

---
---

### Design 1. Allow categorical features ONLY in those forecasters which support natively. (no encoding internally)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In design 1, is the idea to create XGBoostForecaster etc. similar to LightGBM in Darts, basically as a wrapper on top of make_reduction?

  • If yes
    • how do you propose to support for these specific regressors through reduction forecasters from skforecast and darts?
  • If no
    • can you please give more details?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are essentially two options: create this wrapper on top of make_reduction itself or modify make_reduction to work with categorical. I asked for your inputs on the wrapper idea in the umbrella issue and both of you were against it. So for now the design here is on modifying make_reduction to support it based on finalised solutions to req3 and req4.
I am also thinking about the wrapper class idea in a bit more detail and will try to add it here as well.

- Which columns to encode?
- Which encoder to use on which column?
- This is basically trying to provide the entire encoding ecosystem internally.
- Will be very difficult to implement. Where will the user specify all this data?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can pass configuration for the forecaster to specify encoding choices, with a specific encoding method (say one hot encoding being default). It'll look like forecast.set_config(encoding_configuration=...).

The values can be of 2 types:

  1. single encoder instance -> use feature kind to identify categorical features
  2. dictionary with column names as keys and value as encoder instance

In my point of view, the pros are that this will not need to modify signature of any forecasters, and users have full freedom for encoding choices if they so want. This can also be set on a pipeline instance and can be forwarded to all estimators in the pipeline possibly to ensure all of them use same encoding method (preferably same encoder instance).

Requesting @Abhay-Lejith and @fkiraly to suggest cons.

Copy link
Contributor

@fkiraly fkiraly Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem with this is that if we use set_config, the modelling choices here will not be tunable, or visible via set_params, which is the "big advantage" of the sklearn-like interface.

I would feel strongly that these choiecs - as they are modelling choices - should sit somewhere where get_params, set_params sees them.

The most natural place would be existing pipelines, and I am noting that the above seems to replicate the functionality of ColumnEnsembleTransformer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few clarification questions

  1. I am definitely unfamiliar with ColumnEnsembleTransformer in sktime, how is different from ColumnTransformer (from sktime or from sklearn)?

  2. Can you please elaborate how can this transformer be used for the orchestration requirement I mentioned earlier in Discord (use encoding with estimator specific scope so that we only do encoding where required and use native support whereever available)? Of course, "we are not giving that option to user" is also an option, is that the plan?

  3. If you are suggesting that encoding method (or methods if we allow variation per columns) needs to be tunable, which does make sense if we go by pipeline approach, I think designs 2-3 are not suitable because they rely on fixed encoder and 1 is kind-of passable as it does not do encoding at all and rely on whatever encoding estimators do natively (@Abhay-Lejith please confirm).

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About point 3, yes you are right.
I did not have in mind this 'tunable' requirement, until @fkiraly brought it up as a con to the set_config idea. So the fixed default encoder designs will not be tunable since it is a 'fixed' default.

About point 2, this case of a pipeline where one estimator may have native support and another may not and we wish to use the encoded features in one but the native support in the other seems like a rather niche case to me. Why not just use the encoded features for all estimators in the pipeline. Also, just so I am clear with it, could you give an example of such a pipeline with code?

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short thought: at different stages, we are talking about on-hot encoding or dropping.

I think this can be abstracted as a discussion of default or option for transformer in the "no categorical support" / "categorical features present" case.

There are two transformers explicitly mentioned:

  • "drop categorical columns"
  • "encode cateogrical columns with one-hot-encoder"

both transformers share the property that they remove categorical columns and leave only numerical columns.

@Abhay-Lejith
Copy link
Author

@fkiraly @yarnabrina I have restructured the proposal and solution designs based on the data has categorical, model doesnt support case as discussed on discord and taking into account other points and suggestions.

## About reduction case:

### Models which support categorical natively that have been identified so far:
- xgboost
Copy link
Contributor

@fkiraly fkiraly Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are tabular regression models, not models in the scope of sktime. Please discuss models within the scope of sktime, even if abstract or speculative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have mentioned just below that we will have to use reduction for these.
I did not find any sktime forecasters that will directly take categorical variables in exog.


## Solutions for some of the requirements:

### Requirement 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no requirements specified above, so it is a bit vague what this is solving.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I have listed requirements above in L-41 to L-52 .

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This is going in the right direction. I think though we need to really separate out the different decisions in here, currently different decisions are mixed inside the different designs.

I can spot the following decisions:

  • how the information about which columns are categorical reaches the forecaster boilerplate, and the forecaster internals.
    • there might be only a single option considere (via datatypes check)
  • how the information about which numerical columns should be considered categorical is entered by the user
    • options I see in the document: __init__ arg of the forecaster; argument of a component; set_config; __init__; this is not explicitly specified but done by pipelining with a transformer
    • sub-question: do we require all forecasterse to have a cat_features argument or not. If yes, what does it do.
  • if the user wants to specify an encoder, how is this done? This is a separate decision, and could be solved in multiple ways. E.g., __init__ param, set_config, done via pipeline, manually encode, etc. Some designs may preclude one of multiple solutions here.

@fkiraly
Copy link
Contributor

fkiraly commented Jun 28, 2024

some unordered notes from today´s meetup, by @yarnabrina and myself:

AR, FK
discussing categorical feature support designs

use case: pipeline with encoder

from <library> import <Encoder>
from sktime.<...> import <Forecaster>

pipe = Encoder(...) ** Forecaster(...)
...  # -> only needs to allow mtypes to support non-num data, eveything else is user's responsibility to handle categorical vars

FK: morally agree

There is the issue of broadcasting, if encoding is not stable wrt order of occurrence of categories, for instance. Encoder might be broadcast across instances with unstable encoding.

Example:

panel with two series
series 1: "A" "B" "C" "A" "B" "C"
series 2: "B" "C" "A" "B" "C" "A"

if encoder maps categories in order of occurrence on 1, 2, 3, then applying by series will result in encoding 123123 for both, but it should be 123123 and 231231

Problem happens if we broadcast! E.g., encoder is "local", Forecaster is "global/hierarchical"

use case: custom encoder specification

from <library> import <Encoder>
from sktime.<...> import <Forecaster>

pipe = Forecaster(...)

pipe.set_config(cat_features={"var1": OneHotEncoder(), "var2": FeatureBinner()})

FK comments:

  1. this is not tunable, parameter choices not exposed to the user. Feels strongly that modelling choices should be exposed via the dedicated interfaec point, set_params / get_params
  2. config fields also do not support estimators (afaik), this can lead to problems. E.g., estimators are not cloned if passed to onfig.
  3. functionally, this is the same as ColumnEnsembleTransformer, or pipeline with FeatureUnion

With ColumnEnsembleTransformer, it would look like this:

from <library> import <Encoder>
from sktime.<...> import <Forecaster>

col_tr = ColumnEnsembleTransformer(
    ("onehot", OneHotEncoder(), "var1"),
    ("bin", FeatureBinner(), "var2"),
)
pipe = col_tr ** Forecaster(...)

FK: this looks like the same amount of complexity in specification, and the advantage is that it does not introduce separate, special syntax to cover the case.

AR: looks ok for design 3
would like to have orchestration feature in-between

pipe = est1 * est2 * est3
# est1, est3 -> does not support categorical variables
# est2 does support categorical variables
# I wish to use native categorical support for est2 and use encoding for est1 and est3
# doing pipeline wth encoder at first will not allow that

# pipe = (enc * est1 * enc_rev) * est2 * (enc * est3 * enc_rev)

# y, X = ...
# X has few cat features, unknown for future. it also has few cat features, known for future.
# predict unknown cat features first (similar to ForecastX) 

AR summary

  1. user specifies categorical features (optionally).
  2. sktime auto-determines categorical features using feature kind.
  3. in case of direct use of a forecaster, convert forecaster to ColumnEnsembleTransformer() ** forecater -> uses a specific encoder by default (e.g. one hot)
  4. else user creates that pipeline themselves

FK: this looks like design 2b

  • default encoding via OneHotEncoder in yes/no case
  • user uses ColumnEnsembleTransformer in specification case

for reference, we should probably consider clear names:

  • "yes/no case" - estimator does not support categorical data, but is passed
  • "encoder override" - user wants to select a specific different encoder for a packaged estimator that has a hard-coded internal encoder
  • "custom specification" - user wants to specify different encoders for different variables/columns

In design 2b, these are covered as follows:

  • "yes/no case" - OneHotEncoder is automatically applied, internally
  • "encoder override" - user pipelines MyEncoder() ** Forecaster()
  • "custom specification" - user pipelines ColumnEnsembleTransformer(stuff) ** Forecaster()

AR question: what are changes from current to 2b above?
difference between: current main, and above target (2b), in terms of "todos", PRs, etc

FK thinks:

  • enabling categorical in mtypoes
  • feature_kind mechanism
  • in specific models that support categorical, use feature_kind to pass to internal cat_variables, do not expose at __init__
  • ensure that categorical encoders work in TabularToSeriesAdapter, or similar, and they lead to consistent encoding in case of vectorization. I.e., ensure that OneHotEncoder() ** forecaster has all desired properties for existing forecaster (no categorical variables supported)
    • sub-task, ensure that sklearn encoders play nicely with composiors like ColumnEnsembleTransformer

AR query: if user wants to treat a specific numeric variable (say ratings 1-5 scale) as categorical, how to specify?

Q: as binary multioutput?
In design 2b, it would be

ColumnEnsembleTransformer(["onehot", OneHotEncoder(), "varname_to_encode"]) ** forecaster

or simply OneHotEncoder() ** forecaster if it is the only exogeneous variable

Or, as single categorical (multiclass, single output), it could be sth like

Categorizer("varname_to_encode") ** forecaster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PR in progress
Development

Successfully merging this pull request may close these issues.

3 participants