categorical support design proposal #36

Abhay-Lejith · 2024-06-15T06:33:19Z

@fkiraly @yarnabrina Here is the enhancement proposal for categorical support.

yarnabrina

Added few comments. One additional one here: have you planned pipeline cases?

For simplicity, let's consider sequential pipelines. One earlier step may use an estimator that support categorical features natively while another afterwards step may not, and vice versa. What will happen in those cases?

steps/24_categorical_support/step.md

yarnabrina · 2024-06-15T08:09:20Z

steps/24_categorical_support/step.md

+| **Estimator supports categorical natively** | **Cat feature specified by user** | **Cat feature present in data** | **Outcome** | 
+|:-----------------:|:-------------------------:|:--------------:|:---------------:|
+| NO | NO | NO | normal case, proceeds as usual without error. |
+| NO | NO | YES| informative error message to user that categorical feature is not supported by the estimator |


Thoughts: should it warn and then work by dropping specified column(s)?

that's a different design, there are multiple options - encode, drop, or fail

I mentioned as "thoughts" to suggest to @Abhay-Lejith to consider this as another possibility (albeit not very desirable) too to do pros and cons. The intention seems to be miscommunicated.

steps/24_categorical_support/step.md

yarnabrina · 2024-06-15T08:12:22Z

steps/24_categorical_support/step.md

+| YES| YES| NO | informative error message that specified column was not present in data.|
+| YES| YES| YES| use categorical columns as detected by feature_kind along with the user specified columns(union) and proceed. |
+
+- in case where specified column was not found, we can also choose to ignore and continue with warning that column was not found.


Clarification: what do you mean by "continue" here?

For example, let's consider ARIMA for revenue forecasting and we have categorical feature of user rating. What will happen?

@yarnabrina This is design 1 which allows categorical only in estimators which support natively. ARIMA does not support natively, so it would just fail saying categorical not supported.
Also, in the other comments, could you also mention which line number you are referring to?

so, if we have a different approach, I would suggest you record it and try to structure the different choices one can make here

@fkiraly , the document has three major headings that say Design 1, Design 2 and Design 3. Is it really not clear?

yarnabrina · 2024-06-15T08:14:11Z

steps/24_categorical_support/step.md

+---
+---
+
+### Design 2. Perform one hot encoding if not natively supported. 


Notes: definitely can be default choice but should be configurable per column and/or overall.

Is this really required though? Encoders exist for this very purpose and are available to users. Why do we have to provide all that functionality inbuilt? In sklearn, there is nothing of this sort. Users can just preprocess the data and then use it in estimators.

In my opinion, it is really required.

one-hot encoder is not applibale or suitable always.

variables with a lot of levels may create too many features, which can cause issues.

ordinal variables will loose complete order.

if encoded outside, there will not be any possibility to use native support from the estimator when available in a pipeline step, and use encoding otherwise etc. I asked the pipeline question very much for this reason.

since one hot encoder change the dimension of the data, it can actually be not supported in some cases or be complicated due to the collinearity it usually adds (unless you drop one column).

I don't think we should not do a direct comparison with sklearn in planning everything about sktime. They are not the same package, and the capabilities they provide are also not at same level mostly (sklearn implements all their algorithms, in most places sktime framework gives unified format from different underlying packages). This is very likely opinionated, and @fkiraly and other core developers may not have the same view as well. But giving user the choice should be preferred over fixing something in my opinion.

Yes of course. I agree that providing all options available is better than a single fixed encoding option.

I meant, is the internal encoding necessary . And why?

But giving user the choice should be preferred over fixing something in my opinion.

Precisely, and they have full choice on what columns to encode, what encoder to use and they can do it externally. I don't quite get why we want to make such major changes to sktime only to provide what can already be done.

Trying to cram such a large functionality into an inbuilt feature will prove to be very difficult. We would need to allow user to specify which columns to encode, which encoders to use, which encoder to use on which column etc.

On the other hand, if we fix a default encoding type, then in most cases the user might have to resort to external encoding anyway. So the internal encoding is not very useful.

Indeed it's a complex change, there is no doubt about that at all. But that can't be the reason to not even consider in planning design phase, complexity of the task is why it was considered for full time summer project instead of open source contributors adding parts of it based on their availability.

if encoded outside, there will not be any possibility to use native support from the estimator when available in a pipeline step, and use encoding otherwise etc. I asked the pipeline question very much for this reason.

I think you did not consider this case, please correct me if I am wrong. I'll give a possible flow below, this is similar to @benHeid 's advanced non-linear pipeline.

Suppose a forecasting pipeline has 3 steps: first it predicts few of the future-unknown X variables using other future-known X variables, then their union is used to predict y. The forecasters used in the two steps need not be same, and one may have native support for categorical features and other may not. In that case, it's perfectly reasonable for user to want to take benefit of native support when available and use encoding when not.

I am visualising that in this case we need to encode only for the forecaster that does not have native support, not for the other one. If there are more estimators in the pipeline where same variable(s) need to be encoded, reuse same trained encoder there too instead of training again and again. This does not have to be the only way, and there can be better methods, but as of now I am not sure if this can be handled if the encoding of categorical features happen outside of sktime pipeline by users directly.

yarnabrina · 2024-06-15T08:15:55Z

steps/24_categorical_support/step.md

+| YES| YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed|
+
+#### Estimators that will be affected:
+- all estimators that take exog input


Clarification: are we limiting the scope of categorical support to only forecasters and to only X?

For now, I think we can start with just X. For y, there will be complications I feel. One major one I can think of immediately is univariate becoming multivariate on one-hot encoding.
We can of course extend to y later I suppose.

One major one I can think of immediately is univariate becoming multivariate on one-hot encoding.

I think you making leaps of argumentation here, @Abhay-Lejith - you keep presupposing the "one hot encoding" idea, and silently assuming that is the only way to deal with categorical data.

What I mean is, when actively asked you are certainly aware that this is not the case, but your "intuitive thought processes" still seem to assume it, as the above shows for me.

@fkiraly I just brought up a certain case where if the user were to one-hot encode y, it could cause issues. I never said it is the only case. Is anything wrong in this?

yarnabrina · 2024-06-15T08:20:13Z

steps/24_categorical_support/step.md

+
+### Design 2. Perform one hot encoding if not natively supported. 
+
+- `categorical_feature`  param will be present in all models.


Thoughts: if this is available as a configuration instead of as an argument, will that be feasible? If so, will it help?

@Abhay-Lejith @fkiraly any thoughts about this comment?

Abhay-Lejith · 2024-06-15T15:13:57Z

One additional one here: have you planned pipeline cases?

No, I have not gone into the pipeline part of things yet.

fkiraly · 2024-06-15T17:07:16Z

steps/24_categorical_support/step.md

+## Introduction
+
+### Objectives 
+Allow users to pass data with categorical features to any estimator. If not supported, return an informative error message. Otherwise it should run without errors.


This objective already assumes a design. "if not supported, ..."

Try to separate objective from solutioning, please.

Yes, I didn't notice, will change it.

fkiraly · 2024-06-15T17:08:13Z

steps/24_categorical_support/step.md

+## Description of proposed designs
+
+
+For each design we will have to discuss 8 combinations of cases:


try to give more context to the reader, what is the context/setting? no.2 in particular may be unclear without

fkiraly · 2024-06-15T17:08:49Z

steps/24_categorical_support/step.md

+3. categorical present in data: yes/no
+
+## Specifying categorical features:
+- The reason I proposed this is in case the user wants to treat a column with numerical values as categorical. This will not be detected by feature_kind as it relies only on the dtype.


"i proposed this" - missing context. Try to separate solutions from requirements, and ensure the document contains context before you reference the context.

fkiraly

Great start!

My top level comment is that you should start with the problem and requirements for a solution, before talking about solutions, or assuming solutions.

For instance, example requirements:

looping over estimators should be easy for the user and follow the strategy pattern as much as possible (exchangeability)
informative error messages are always raised in case of failures

Further comment, this is a general one that applies in many places: try to give context before you discuss opinions or solutions. For instance, what is the problem you are solving? Or, what are expected inputs or behaviours?

Abhay-Lejith · 2024-06-20T17:05:51Z

@fkiraly @yarnabrina , I've updated the proposal and worked on your feedback. Please take a look.

fkiraly · 2024-06-20T22:25:10Z

steps/24_categorical_support/step.md

+Allow users to pass exog data with categorical features to any forecaster. Outcome will depend on whether the forecaster natively supports categorical or not.
+Two broadly possible outcomes:
+- error in fit saying not natively supported.
+- internally encode and fit successfully.


we do not need to encode always.

If not natively supported, above two are the only options right?

fkiraly · 2024-06-20T22:26:08Z

steps/24_categorical_support/step.md

+
+
+### Models which support categorical natively that have been identified so far:
+- xgboost


how does this compare to the list in
sktime/sktime#6109 (comment)
and subsequent discussion?

This list is of forecasters with native support. And I am currently looking at only categorical in exog.

fkiraly · 2024-06-20T22:28:25Z

steps/24_categorical_support/step.md

+    This information cannot be detected using the feature_kind metadata. It has to be passed by the user in some manner.
+
+    - Possible solutions:
+        - 'categorical_features' arg in model where user can specify such columns.


more solutions:

the user converts the data outside sktime, pandas does have this feature, for instance

a pipeline element that does that

This line L-86 is the same right?

not deal with this case and expect user to convert from numerical to categorical. We can make a transformer for this purpose.

fkiraly · 2024-06-20T22:28:53Z

steps/24_categorical_support/step.md

+        - not deal with this case and expect user to convert from numerical to categorical. We can make a transformer for this purpose.
+
+### Req 2
+We will use a tag `categorical_support`(=True/False) or similar to identify whether a forecaster has native support.


that would be a capability tag, so in line with most such tags it would be capability:something

fkiraly · 2024-06-20T22:30:27Z

steps/24_categorical_support/step.md

+- Set tag depending on model used in reduction.
+    - Since there are few models(3) that have been identified with native support, a hard coded method like an if/else statement to check the model could be used.
+        - Would be more difficult to extend if new estimators with native support are found.
+    - Maintain some sort of list internally in sktime, of estimators with native support and check with this.


further option: try/except with a minuscule dataset, can be combined with a "known" list

sklearn also has tags, is there perhaps one for this?

fkiraly · 2024-06-20T22:31:32Z

steps/24_categorical_support/step.md

+1. Need to know whether categorical data is present in X_train and which columns.
+2. Need to know if forecaster has native categorical support.
+3. In reduction, we need to know if model used has native categorical support.
+4. In reduction, we need to pass parameters(like `enable_categorical=True` in case of xgboost) to the model used if natively supported according to the requirements of the model.


is not a requirement but a solution. Try to separate these clearly. I do not see at least why this needs to be the case.

It is a requirement. xgboost for instance 'requires' this parameter to be passed to it when using categorical.

I meant, "specification requirement" - what are requirements towards the piece of software that you will implement. That´s a specifc use of the word: https://en.wikipedia.org/wiki/Software_requirements

fkiraly · 2024-06-20T22:32:04Z

steps/24_categorical_support/step.md

+y_pred = forecaster.predict(X_test,fh=[1,2])
+```
+
+## Requirements


start with high-level requirements, e.g., what does the API need to do? Then proceed to driving lower level ones. Try to avoid mixing up requirements and solutions.

I do not understand. Can you give me an example of a requirement.
I don't think any of the below points are solutions. I have listed the requirements below and then discussed solutions to each separately in the next section.

fkiraly · 2024-06-20T22:33:13Z

steps/24_categorical_support/step.md

+* Non-numeric dtypes are completely blocked off in datatype checks.
+* https://github.com/sktime/sktime/pull/5886 - This pr removes these checks but has not been merged yet as most estimators do not support categorical and will break without informative error messages.
+* https://github.com/sktime/sktime/pull/6490 - pr adding feature_kind metadata
+


I would suggest, start with outlining a conceptual model. Explain what kind of (abstract) estimators there are, and what you want to model. Then proceed to example code that should run, try to cover different cases.

Wouldn't this 'conceptual model' heavily depend on which design we choose to go ahead with. Or do you want me to do this for all the proposed designs?

Explain what kind of (abstract) estimators there are, and what you want to model

I don't think there are any new estimators being made, changes will have to be made to existing classes and again, the change will depend on what design we go for.

yarnabrina · 2024-06-22T15:57:01Z

steps/24_categorical_support/step.md

+### Design 3. Same as design 2 but NO `cat_features` parameter in model
+Instead, user is required to convert the required numerical features to categorical. To ease the process, we can make a `transformer` to make this conversion.
+
+- no `cat_features` parameter in models.
+
+| **Estimator supports categorical natively** | **Cat feature present in data** |  **Outcome**  |
+|:-------------------------------------------:|:-------------------------------:|:---------------:|
+| NO |  NO | normal case, fits without error. |
+| NO |  YES| encode features internally as detected by feature_kind and continue to fit.
+| YES|  NO | normal case, fits without error.|
+| YES| YES| use categorical features detected by feature_kind and fit|
+
+#### PROS:
+- All forecasters will be able to accept categorical inputs.
+
+#### CONS:
+- Not flexible in choice of encoding (label,one-hot,etc) if we fix a default. If user wants different types, will have to resort to externallly doing it anyway.
+- Removing parameter comes at cost of requiring user to convert numerical to categorical themselves. 
+    - If this is the case then user can just convert categorical to numerical and use the data. Providing internal encoding is kind of redundant.


I'm quite confused about design 3. L188 and L204 mentions that user is required to own the encoding, but L195 and L203 suggests we are encoding with a fixed default encoding method. Can you please help me understand what exactly is the plan?

@yarnabrina the encoding the user needs to do here is from 'numerical' to 'categorical' for the cases where they want to treat a numerical column as categorical feature. (because we cannot detect these by feature_kind) . Internal encoding is from categorical to numerical and is done by a fixed default. Is it clear now?

yarnabrina · 2024-06-22T16:02:02Z

steps/24_categorical_support/step.md

+    - This is basically trying to provide the entire encoding ecosystem internally.
+    - Will be very difficult to implement. Where will the user specify all this data? 
+
+- I don't think this is a good idea.


Thanks for sharing your opinion, but just to understand better can you please tell whether you think it's a bad idea or it's a complex idea? These are not necessarily same.

If it's the first one, viz. bad idea, can I request you to explain why? I've shared my reasoning with respect to multi-step orchestration process above, would like to know yours.

For me, I do not understand what concretely design 4 is, so I cannot form an opinion due to lack of specificity and information.

What would help me a lot is if in all cases, the scenario "data has cateorical, estimator does not support" is concretely spelled out. How does the speculative code look like, what happens internally?

@fkiraly I have mentioned (L-210) that it is the same as Design 2 or 3, but we are extending it by trying to provide more encoding options to the user.
So in the scenario "data has categorical, estimator does not support", it will encode the categorical features and fit.
I haven't gone into more detail of how it will look yet as I was not sure about how we would allow the user to specify all this data.

@yarnabrina , sorry about bluntly saying it's a bad idea without any explanation.
But my thought was that doing so much only to provide an existing functionality under the hood seemed unnecessary to me. Also, I did not think of the set_config idea you proposed, so I was also wondering how we would let the user specify so much data.
Edit: I was thinking ColumnTransformer but I see that sktime has its own version as ColumnEnsembleTransformer(as noted by @fkiraly as well)

yarnabrina · 2024-06-22T16:05:15Z

steps/24_categorical_support/step.md

+
+### Design 2. Perform one hot encoding if not natively supported. 
+
+- `categorical_feature`  param will be present in all models.


@Abhay-Lejith @fkiraly any thoughts about this comment?

yarnabrina · 2024-06-22T16:08:38Z

steps/24_categorical_support/step.md

Based on my understanding, this is the case:

design 1 requires users to specifies which features are categorical.

design 3 requires users to encode externally.

So, a general question: is design 2 the only design to make use of feature kind PR?

I have clarified about design 3 in #36 (comment). Please check it out.

yarnabrina · 2024-06-22T16:12:40Z

steps/24_categorical_support/step.md

+- error in fit saying not natively supported.
+- internally encode and fit successfully.


Between L10 and L11, is it not considered a valid option?

"Fit/predict by dropping/ignoring non-numerical features"

It's probably more useful than complete failure, and can possibly be combined with other designs, especially design 1 to increase scope to all forecasters.

I did comment this earlier as well, don't know what you think about this @Abhay-Lejith @fkiraly .

Yes it is a possible solution. I am updating the design document once again and will add this.

yarnabrina · 2024-06-22T16:18:20Z

steps/24_categorical_support/step.md

+---
+---
+
+### Design 1. Allow categorical features ONLY in those forecasters which support natively. (no encoding internally)


In design 1, is the idea to create XGBoostForecaster etc. similar to LightGBM in Darts, basically as a wrapper on top of make_reduction?

If yes

how do you propose to support for these specific regressors through reduction forecasters from skforecast and darts?

If no

can you please give more details?

There are essentially two options: create this wrapper on top of make_reduction itself or modify make_reduction to work with categorical. I asked for your inputs on the wrapper idea in the umbrella issue and both of you were against it. So for now the design here is on modifying make_reduction to support it based on finalised solutions to req3 and req4.
I am also thinking about the wrapper class idea in a bit more detail and will try to add it here as well.

yarnabrina · 2024-06-22T16:25:10Z

steps/24_categorical_support/step.md

+        - Which columns to encode?
+        - Which encoder to use on which column?
+    - This is basically trying to provide the entire encoding ecosystem internally.
+    - Will be very difficult to implement. Where will the user specify all this data? 


Users can pass configuration for the forecaster to specify encoding choices, with a specific encoding method (say one hot encoding being default). It'll look like forecast.set_config(encoding_configuration=...).

The values can be of 2 types:

single encoder instance -> use feature kind to identify categorical features

dictionary with column names as keys and value as encoder instance

In my point of view, the pros are that this will not need to modify signature of any forecasters, and users have full freedom for encoding choices if they so want. This can also be set on a pipeline instance and can be forwarded to all estimators in the pipeline possibly to ensure all of them use same encoding method (preferably same encoder instance).

Requesting @Abhay-Lejith and @fkiraly to suggest cons.

My problem with this is that if we use set_config, the modelling choices here will not be tunable, or visible via set_params, which is the "big advantage" of the sklearn-like interface.

I would feel strongly that these choiecs - as they are modelling choices - should sit somewhere where get_params, set_params sees them.

The most natural place would be existing pipelines, and I am noting that the above seems to replicate the functionality of ColumnEnsembleTransformer.

Few clarification questions

I am definitely unfamiliar with ColumnEnsembleTransformer in sktime, how is different from ColumnTransformer (from sktime or from sklearn)?

Can you please elaborate how can this transformer be used for the orchestration requirement I mentioned earlier in Discord (use encoding with estimator specific scope so that we only do encoding where required and use native support whereever available)? Of course, "we are not giving that option to user" is also an option, is that the plan?

If you are suggesting that encoding method (or methods if we allow variation per columns) needs to be tunable, which does make sense if we go by pipeline approach, I think designs 2-3 are not suitable because they rely on fixed encoder and 1 is kind-of passable as it does not do encoding at all and rely on whatever encoding estimators do natively (@Abhay-Lejith please confirm).

About point 3, yes you are right.
I did not have in mind this 'tunable' requirement, until @fkiraly brought it up as a con to the set_config idea. So the fixed default encoder designs will not be tunable since it is a 'fixed' default.

About point 2, this case of a pipeline where one estimator may have native support and another may not and we wish to use the encoded features in one but the native support in the other seems like a rather niche case to me. Why not just use the encoded features for all estimators in the pipeline. Also, just so I am clear with it, could you give an example of such a pipeline with code?

fkiraly

Short thought: at different stages, we are talking about on-hot encoding or dropping.

I think this can be abstracted as a discussion of default or option for transformer in the "no categorical support" / "categorical features present" case.

There are two transformers explicitly mentioned:

"drop categorical columns"
"encode cateogrical columns with one-hot-encoder"

both transformers share the property that they remove categorical columns and leave only numerical columns.

Abhay-Lejith · 2024-06-23T10:54:09Z

@fkiraly @yarnabrina I have restructured the proposal and solution designs based on the data has categorical, model doesnt support case as discussed on discord and taking into account other points and suggestions.

fkiraly · 2024-06-27T07:56:21Z

steps/24_categorical_support/step.md

+## About reduction case:
+
+### Models which support categorical natively that have been identified so far:
+- xgboost


these are tabular regression models, not models in the scope of sktime. Please discuss models within the scope of sktime, even if abstract or speculative.

Yes I have mentioned just below that we will have to use reduction for these.
I did not find any sktime forecasters that will directly take categorical variables in exog.

fkiraly · 2024-06-27T07:57:15Z

steps/24_categorical_support/step.md

+
+## Solutions for some of the requirements:
+
+### Requirement 1


there are no requirements specified above, so it is a bit vague what this is solving.

@fkiraly I have listed requirements above in L-41 to L-52 .

fkiraly

Thanks!

This is going in the right direction. I think though we need to really separate out the different decisions in here, currently different decisions are mixed inside the different designs.

I can spot the following decisions:

how the information about which columns are categorical reaches the forecaster boilerplate, and the forecaster internals.
- there might be only a single option considere (via datatypes check)
how the information about which numerical columns should be considered categorical is entered by the user
- options I see in the document: __init__ arg of the forecaster; argument of a component; set_config; __init__; this is not explicitly specified but done by pipelining with a transformer
- sub-question: do we require all forecasterse to have a cat_features argument or not. If yes, what does it do.
if the user wants to specify an encoder, how is this done? This is a separate decision, and could be solved in multiple ways. E.g., __init__ param, set_config, done via pipeline, manually encode, etc. Some designs may preclude one of multiple solutions here.

fkiraly · 2024-06-28T14:18:04Z

some unordered notes from today´s meetup, by @yarnabrina and myself:

AR, FK
discussing categorical feature support designs

use case: pipeline with encoder

from <library> import <Encoder>
from sktime.<...> import <Forecaster>

pipe = Encoder(...) ** Forecaster(...)
...  # -> only needs to allow mtypes to support non-num data, eveything else is user's responsibility to handle categorical vars

FK: morally agree

There is the issue of broadcasting, if encoding is not stable wrt order of occurrence of categories, for instance. Encoder might be broadcast across instances with unstable encoding.

Example:

panel with two series
series 1: "A" "B" "C" "A" "B" "C"
series 2: "B" "C" "A" "B" "C" "A"

if encoder maps categories in order of occurrence on 1, 2, 3, then applying by series will result in encoding 123123 for both, but it should be 123123 and 231231

Problem happens if we broadcast! E.g., encoder is "local", Forecaster is "global/hierarchical"

use case: custom encoder specification

from <library> import <Encoder>
from sktime.<...> import <Forecaster>

pipe = Forecaster(...)

pipe.set_config(cat_features={"var1": OneHotEncoder(), "var2": FeatureBinner()})

FK comments:

this is not tunable, parameter choices not exposed to the user. Feels strongly that modelling choices should be exposed via the dedicated interfaec point, set_params / get_params
config fields also do not support estimators (afaik), this can lead to problems. E.g., estimators are not cloned if passed to onfig.
functionally, this is the same as ColumnEnsembleTransformer, or pipeline with FeatureUnion

With ColumnEnsembleTransformer, it would look like this:

from <library> import <Encoder>
from sktime.<...> import <Forecaster>

col_tr = ColumnEnsembleTransformer(
    ("onehot", OneHotEncoder(), "var1"),
    ("bin", FeatureBinner(), "var2"),
)
pipe = col_tr ** Forecaster(...)

FK: this looks like the same amount of complexity in specification, and the advantage is that it does not introduce separate, special syntax to cover the case.

AR: looks ok for design 3
would like to have orchestration feature in-between

pipe = est1 * est2 * est3
# est1, est3 -> does not support categorical variables
# est2 does support categorical variables
# I wish to use native categorical support for est2 and use encoding for est1 and est3
# doing pipeline wth encoder at first will not allow that

# pipe = (enc * est1 * enc_rev) * est2 * (enc * est3 * enc_rev)

# y, X = ...
# X has few cat features, unknown for future. it also has few cat features, known for future.
# predict unknown cat features first (similar to ForecastX)

AR summary

user specifies categorical features (optionally).
sktime auto-determines categorical features using feature kind.
in case of direct use of a forecaster, convert forecaster to ColumnEnsembleTransformer() ** forecater -> uses a specific encoder by default (e.g. one hot)
else user creates that pipeline themselves

FK: this looks like design 2b

default encoding via OneHotEncoder in yes/no case
user uses ColumnEnsembleTransformer in specification case

for reference, we should probably consider clear names:

"yes/no case" - estimator does not support categorical data, but is passed
"encoder override" - user wants to select a specific different encoder for a packaged estimator that has a hard-coded internal encoder
"custom specification" - user wants to specify different encoders for different variables/columns

In design 2b, these are covered as follows:

"yes/no case" - OneHotEncoder is automatically applied, internally
"encoder override" - user pipelines MyEncoder() ** Forecaster()
"custom specification" - user pipelines ColumnEnsembleTransformer(stuff) ** Forecaster()

AR question: what are changes from current to 2b above?
difference between: current main, and above target (2b), in terms of "todos", PRs, etc

FK thinks:

enabling categorical in mtypoes
feature_kind mechanism
in specific models that support categorical, use feature_kind to pass to internal cat_variables, do not expose at __init__
ensure that categorical encoders work in TabularToSeriesAdapter, or similar, and they lead to consistent encoding in case of vectorization. I.e., ensure that OneHotEncoder() ** forecaster has all desired properties for existing forecaster (no categorical variables supported)
- sub-task, ensure that sklearn encoders play nicely with composiors like ColumnEnsembleTransformer

AR query: if user wants to treat a specific numeric variable (say ratings 1-5 scale) as categorical, how to specify?

Q: as binary multioutput?
In design 2b, it would be

ColumnEnsembleTransformer(["onehot", OneHotEncoder(), "varname_to_encode"]) ** forecaster

or simply OneHotEncoder() ** forecaster if it is the only exogeneous variable

Or, as single categorical (multiclass, single output), it could be sth like

Categorizer("varname_to_encode") ** forecaster

categorical support design proposal

79ab183

yarnabrina reviewed Jun 15, 2024

View reviewed changes

fkiraly reviewed Jun 15, 2024

View reviewed changes

fkiraly requested changes Jun 15, 2024

View reviewed changes

fkiraly assigned Abhay-Lejith Jun 17, 2024

worked on feedback and updated proposal

3f3ce31

fkiraly reviewed Jun 20, 2024

View reviewed changes

yarnabrina reviewed Jun 22, 2024

View reviewed changes

fkiraly reviewed Jun 22, 2024

View reviewed changes

structured designs differently based on discussion

29e3c47

changed clustering of solutions and added conceptual model

f7425f8

fkiraly reviewed Jun 27, 2024

View reviewed changes

fkiraly requested changes Jun 27, 2024

View reviewed changes


		### Design 2. Perform one hot encoding if not natively supported.

		- `categorical_feature` param will be present in all models.

		## Description of proposed designs


		For each design we will have to discuss 8 combinations of cases:



		### Models which support categorical natively that have been identified so far:
		- xgboost

		- error in fit saying not natively supported.
		- internally encode and fit successfully.

categorical support design proposal #36

Are you sure you want to change the base?

categorical support design proposal #36

Conversation

Abhay-Lejith commented Jun 15, 2024

yarnabrina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhay-Lejith Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarnabrina Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhay-Lejith Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhay-Lejith commented Jun 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

Abhay-Lejith commented Jun 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhay-Lejith Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkiraly Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhay-Lejith Jun 23, 2024 • edited Loading

Choose a reason for hiding this comment

fkiraly left a comment

Choose a reason for hiding this comment

Abhay-Lejith commented Jun 23, 2024

fkiraly Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abhay-Lejith Jun 15, 2024 •

edited

Loading

yarnabrina Jun 15, 2024 •

edited

Loading

Abhay-Lejith Jun 15, 2024 •

edited

Loading

fkiraly Jun 15, 2024 •

edited

Loading

Abhay-Lejith Jun 22, 2024 •

edited

Loading

fkiraly Jun 22, 2024 •

edited

Loading

Abhay-Lejith Jun 23, 2024 •

edited

Loading

fkiraly Jun 27, 2024 •

edited

Loading

fkiraly left a comment •

edited

Loading

fkiraly commented Jun 28, 2024 •

edited

Loading