Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Design document for polars support in skpro #34

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

julian-fong
Copy link

Opening a pr to discuss various methods and ideas to implement polars support for estimators in skpro. All the information design wise will be consolidated in this pr.

Open to any contributions or ideas!

@julian-fong julian-fong changed the title initial commit WIP - Design document for polars support in skpro May 31, 2024
@julian-fong
Copy link
Author

Linking sktime/skpro#342 as the coordination pr for implementation of these ideas

@julian-fong
Copy link
Author

@fkiraly I have written down notes for a few of the sections (see sections 2 and 3) and a couple discussion items (see section 9) Discussions can happen via inside the md or can stay in the comment sections here so that pings work etc..

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replies to the questions below.

Question 2.1: yes, I anticipate that for some interfaced estimators there will never be internal polars support, e.g., the statsmodels based ones. On the other hand, the framework should support internal polars in native implementations, for direct interfavces to packages that support or will support polars, or for mixtures like composites where part of the logic may be in skpro and part of the logic may be external.

Question 2.2: what do you mean by this? Every estimator should accept all data containers, if the boilerplate layer works, and which types are supported interanlly you can see in the tag X_inner_mtype and y_inner_mtype.

Discussion items:

Item 1 idea 1 - sounds good, do you have a link or ref on how index is handled?

Item 2 idea 1 - I would strongly recommend not to focus on support for polars.Series. This is based on experience in sktime where supporting both Series and DataFrame in pandas has caused maintenance and development overhead that was out of proportion compared to the user benefit it brought (and one might argue it's actually net negative due to user confusion)

Discussion item 2: I would suggest to avoid predict_proba for the moment, as that would open up the can of works of supporting polars internally in distributions, most of which are internally numpy and scipy based.

Discussion item 3: my (not very strong) opinion is to use set_config as a backend point, and possibly provide syntactic sugar through set_output.

@julian-fong
Copy link
Author

Question 2.2: what do you mean by this? Every estimator should accept all data containers, if the boilerplate layer works, and which types are supported interanlly you can see in the tag X_inner_mtype and y_inner_mtype.

Apologies for the poorly worded question. Right now every estimator has the tags X_inner_mtype and y_inner_mtype with with pandas DataFrame tables - and we know there are two sets: one set which contains estimators for which polars is supported and another set where polars is not supported eg statsmodels. My question is if an idea here is to individually check each interfaced estimator to see if it supports polars dataframes and update the tags to include polars_eager_table if it belongs in the first set.

Discussion items:

Item 1 idea 1 - sounds good, do you have a link or ref on how index is handled?

The link to the polars function from_pandas is here: https://docs.pola.rs/py-polars/html/reference/api/polars.from_pandas.html . The bool parameter include_index includes any index from the pandas Dataframe as a column. From what I've tested using some dummy data the functionality is as follows:

  • If include_index is True and the index in the pandas Dataframe is a single level index that is unnamed - then from_pandas returns a polars dataframe with a column named "None" containing the corresponding indices.
  • If include_index is True and the index in the pandas Dataframe is a multi-level index for which all index levels are named - then from_pandas returns a polars dataframe with columns corresponding to the index level name and their corresponding indices
  • If include_index is True and the index in the pandas Dataframe is a multi-level index for which some of the index levels are unnamed (2 or more) I believe it throws an error.

We can probably leverage this potential second conversion function to handle these scenarios - and pass them back to the users if they specify want them via our own bool parameter return_index

@julian-fong
Copy link
Author

Discussion item 2: I would suggest to avoid predict_proba for the moment, as that would open up the can of works of supporting polars internally in distributions, most of which are internally numpy and scipy based.

Sounds good, then in that case I'll leave the corresponding sections in #TODO for now

@julian-fong
Copy link
Author

If section 3 is sufficiently written - I can go ahead and start building a pr to implement these changes. If there are more areas where you think we need to implement polars support, I am happy to continue to add to that section. I mainly used the BaseProbaRegressor class and the regression extension template in order to consolidate what enhancements I believe that were needed to be made

@fkiraly
Copy link
Contributor

fkiraly commented Jun 5, 2024

If section 3 is sufficiently written - I can go ahead and start building a pr to implement these changes

I'm not sure if you are referring to the current state or have some unpushed changes. The document does not say what the target state is. You are showcasing some column index handling, but it would be great if you could very explicitly discuss:

  • proposed signature of predict_interval, predict_quantiles, predict_var
  • how you are planning to handle predict_proba - leaving it unchanged for now is where we arrived, but that may cause some unpleasant case handling in the default implementation of predict_quantiles (although it might be caught by the boilerplate)
  • in the target state, the relevant cases and flow logic internally if polars or pandas is passed. We have round trip conversion, or passing unconverted internally.
  • where the converters or logic will be located, in relation to predict_interval etc.

Other question, is the tests PR ready?

@julian-fong
Copy link
Author

Other question, is the tests PR ready?

Yes - I would like a review to see where changes are needed for the current tests I wrote, and as I implement new changes I will add more tests inside the test file as well

@julian-fong
Copy link
Author

julian-fong commented Jun 10, 2024

I've added another item (see discussion item 4) and am working to incorporate target states for predict methods. @fkiraly does the one I wrote so far for predict_interval suffice?

If section 3 is sufficiently written - I can go ahead and start building a pr to implement these changes

I'm not sure if you are referring to the current state or have some unpushed changes. The document does not say what the target state is. You are showcasing some column index handling, but it would be great if you could very explicitly discuss:

  • proposed signature of predict_interval, predict_quantiles, predict_var

See above

  • how you are planning to handle predict_proba - leaving it unchanged for now is where we arrived, but that may cause some unpleasant case handling in the default implementation of predict_quantiles (although it might be caught by the boilerplate)

Still thinking about this item

edit: see below

  • in the target state, the relevant cases and flow logic internally if polars or pandas is passed. We have round trip conversion, or passing unconverted internally.

Still working on the planning stages for this, maybe discussion on item 4 would help figure this one out?

@julian-fong
Copy link
Author

  • how you are planning to handle predict_proba - leaving it unchanged for now is where we arrived, but that may cause some unpleasant case handling in the default implementation of predict_quantiles (although it might be caught by the boilerplate)

I believe that as long as pred_int is returned in pandas Dataframe format the default method _predict_quantile shouldn't matter. My aim when altering the code base for _base.py is to alter the handling of the data containers predict_interval, predict_proba, predict_var, predict_quantile, the default corresponding private methods shouldn't be touched imo.

@fkiraly
Copy link
Contributor

fkiraly commented Jun 13, 2024

the default corresponding private methods shouldn't be touched imo.

I see, that's a good solution, and it's good that the defaults are in the private methods, so it works.

@fkiraly
Copy link
Contributor

fkiraly commented Jun 13, 2024

Still working on the planning stages for this, maybe discussion on item 4 would help figure this one out?

For the contributor workflow, in fact both modes are supported, plus an unmentioned third which allows the user to implement only a polars option (in this case polars becomes a softdep of the estimator).

The way this works is via the conversion logic described here:
https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.registry._tags.x_inner_mtype.html#sktime.registry._tags.x_inner_mtype

If any of the mtypes is on the list, and there's only one, any input is converted to it internally, and back at output. If multiple mtypes are on the list, any mtype on the list is passed on unconverted (but possibly after minor coercions). In this case, it is expected that the internal code can handle both types. This is typically a much larger cognitiver burden on the implementer though, as there is no standard dispatch mechanism, so they have to do if/else or dispatch to private methods.

@julian-fong
Copy link
Author

@fkiraly I've made some additions to section 3 detailing: how the predict_* functions will incorporate set_output like logic, and have written down the technical implementation of how polars frames would look like in various predict_* methods

Upper/lower interval end are equivalent to
quantile predictions at alpha = 0.5 - c/2, 0.5 + c/2 for c in coverage.

For pl.DataFrame: Column is in single level following similar convention to pd.DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would work - although in the pd.DataFrame convention the second level is float. Do we forsee any issues in coercing everything to string? This no longer allows, for instance, to distinguish variable names that are string and those that are integer. I suppose the variable name is only an issue in mixed input/output type, if same, then that would always be str.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think column names in polars are allowed to be anything but strings. I've tried setting 0 or 0.9 as a column name but I got an error saying TypeError: Series name must be a string

@fkiraly
Copy link
Contributor

fkiraly commented Jun 25, 2024

Opinions about the questions:

Question 3.1) is this already handled? in the scenario where a polars dataframe is passed and skpro automatically converts it into a pandas DataFrame to calculate the predictions?

Yes, I believe it should be - are you testing it? If not, let's.

Another related comment: what happens if the internal estimator produces predict outputs in polars format, but the user wants pandas.DataFrame? We would then need to convert the format the other way round.

Question 3.2) What should we do with the predict function. Currently it automatically converts whatever is passed into the predict function back into the mtype that was seen in fit. Do we need to refactor this as well?

I think the default behaviour should be the round trip, yes.

However, I think if the user uses set_output, it should be possible to override this - if not too complicated. It may just be as simple to change the to_type in the output conversion to the transform string.

@julian-fong
Copy link
Author

julian-fong commented Jun 26, 2024

Opinions about the questions:

Question 3.1) is this already handled? in the scenario where a polars dataframe is passed and skpro automatically converts it into a pandas DataFrame to calculate the predictions?

Yes, I believe it should be - are you testing it? If not, let's.

When I first began working on this, I tried passing in a polars DataFrame with the GLMRegressor and the library statsmodels complained that it does not support polars DataFrames. Hmm, but I guess that was the fit function and not the predict function.

Maybe another discussion item on how the interaction between fit and predict should work with polars DataFrames?

@julian-fong
Copy link
Author

julian-fong commented Jun 26, 2024

Another related comment: what happens if the internal estimator produces predict outputs in polars format, but the user wants pandas.DataFrame? We would then need to convert the format the other way round.

Correct - this falls under the set of combinations 'polars to pandas'. We can handle this when creating the datatype/adapters module inside #392. My current proposal is that there should be custom conversion methods from polars to pandas (and vice versa) via the create_container method. The create_container method will take in X_input and the user given transform, and make the corresponding conversions.

For examples sake lets consider polars, and lets say for example there is a total of 3 datatype classes created using #392's design: ["pandas", "polars", "numpy"]. Then in the polars datatype class / adapter class, there will be a create_container method that takes in the datatype from X_input (Ideally X_input should fall into one of those three categories). The create_container method should then identify what type of mtype X_input is and use the correct conversion method (either numpy to polars or pandas to polars) in order to generate the correct output.

The polars to pandas conversion should be simple, as the table design of the polars DataFrame is quite trivial (fingers crossed). Just calling it as a pandas DataFrame should be enough.

edit: I just re-read this question and realized that I may have mis-interpreted the question.

Another related comment: what happens if the internal estimator produces predict outputs in polars format, but the user wants pandas.DataFrame? We would then need to convert the format the other way round.

The set_output would still theorically handle this scenario as per above, but I assume this would be a rare case since the x_inner_mtype would have to be polars Table? and the developer who created the interface for this estimator would have had to have written all the private methods using polars DataFrames? i am not sure

@julian-fong
Copy link
Author

julian-fong commented Jun 26, 2024

Question 3.2) What should we do with the predict function. Currently it automatically converts whatever is passed into the predict function back into the mtype that was seen in fit. Do we need to refactor this as well?

I think the default behaviour should be the round trip, yes.

However, I think if the user uses set_output, it should be possible to override this - if not too complicated. It may just be as simple to change the to_type in the output conversion to the transform string.

Agreed - unless the user specifies what they want, its best to keep the current convention. There is a method with a boolean check that I have designed to ensure that the user has specified a transform in set_output (see skpro #399 check_transform_config). If that boolean is True, then we can code it to override the current implementation. If not, then we proceed as if the code as usual.

@julian-fong
Copy link
Author

@fkiraly I've added questions 3.3 and 3.4, and made some minor updates to section 3 as well


~~Question 3.2~~) What should we do with the `predict` function. Currently it automatically converts whatever is passed into the `predict` function back into the mtype that was seen in fit. Do we need to refactor this as well?

Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, there are no "options" in back-conversions, and on particular pathway is assumed.

If information needs to be remembered, it is via the converter_store argument that all conversion functions have.

I'm not sure whether this is a problem that needs tackling, can you explain some different use cases, and why a user may want this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It goes back earlier to Discussion Item 0, potentially giving the user the option to have the index returned via a column named __index__. That way they can keep track of what indices were used during testing.

Copy link
Author

@julian-fong julian-fong Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There currently is no way to allow the user to specify whether they want to include the index or not when converting the dataframe from pandas to polars


Question 3.3) Currently there is no way to allow the user to specify whether they want or do not want to pass back the index when converting from pandas to polars. The user currently only can specify what transform they want via `set_output` and there is no other parameter they can set to specify what configuration they want as the internal code handles the rest of the conversion. How should we tackle this problem?

Question 3.4) Is there a way to check via the tags or configs as to which data container the interface is developed in regards to the private `_predict_*` methods? As an example, the `GLMRegressor` has tag 'coded_in_pandas' so we know that all the private `_predict_*` functions are written using pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the tag is X_inner_mtype or y_inner_mtype, see https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.registry._tags.x_inner_mtype.html

Currently, most implementations assume that there is only a single internal type, or multiple related ones (e.g., multiple scitypes but all pandas based).

Compositors however may assume they support all mtypes, if they only carry out abstract operations such sa passing on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PR in progress
Development

Successfully merging this pull request may close these issues.

2 participants