Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] refactor datatypes mtypes - checkers, converters #392

Merged
merged 45 commits into from
Sep 7, 2024
Merged

Conversation

fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented Jun 16, 2024

This PR refactors the data type specifications and converters to classes.

Implements sktime/sktime#3512, related to sktime/sktime#2957.

Contains:

  • a base class for mtypes, BaseDatatype, to replace the more ad-hoc dictionary design
  • a complete refactor of the Table mtype submodule to this interface
  • a base class for converters, BaseConverter, also replacing a dictionary based design
  • a partial refactor of the Table related converters to this interface
  • a full refactor of the public framework module with check and convert logic, in datatypes, to allow extensibility with this design

Partial mirror in skpro of sktime/sktime#6033

@fkiraly fkiraly added the module:datatypes datatypes module: data containers, checkers & converters label Jun 16, 2024
@fkiraly fkiraly self-assigned this Jun 17, 2024
@julian-fong
Copy link
Contributor

Question : does that mean something like pandas would have a BaseConverter and a BaseDatatype? What is the intuition for separating it into two different classes?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 26, 2024

does that mean something like pandas would have a BaseConverter and a BaseDatatype?

Yes.

What is the intuition for separating it into two different classes?

The intuition is that conversions are indexed by ordered pairs of mtypes, and extensibility.

"ordered pairs"

If you want to put the conversions into classes, you have two "reasonable" options that I can see:

  • conversion goes into a class together with the check. In this case multiple conversions go into the "mtype" class.
  • conversion goes into its own class.

The first choice has some odd implications, e.g., do you put it in the "from" or "to" class? You also suddenly end up coupling dependencies, e.g., where do you put ta conversion between mtypes with two different soft dependencies? You now have code that uses soft dependency 2 in a class that has main soft dependency 1, etc. Sounds unpleasant and difficult to maintain to me.

"extensibility"

Not all conversions are directly implemented, as they grow combinatorially and "exotic" data types are rarely converted to each other. Sometimes developers add direct conversions without modifying the original mtypes. If a conversion is its own class, it is possible to add this without modifying anything else - like with the estimator classes, a new class is added, no class or register is modified.

But, please let me know if you have pertinent ideas, or see designs that I haven't.

@julian-fong
Copy link
Contributor

Right - so if each conversion is its own class, i.e from x to y, we don't clog up BaseDatatype x with various conversion methods from other different datatypes. That does make sense.

Follow up question: could you give an example of how a BaseConverter would look like? Would the naming convention be something like PandasPolarsConverter, if we're marking that it converts pandas to polars

@julian-fong
Copy link
Contributor

julian-fong commented Jun 26, 2024

As a follow up point, I looked at some of sklearn's set_output functionality on how the user given transform specification determines which data container to use. In sklearn, both polars and pandas 'adapter' classes (see https://github.com/scikit-learn/scikit-learn/blob/8d0b243cb53ff609d32ecd7aafc5c098381eac86/sklearn/utils/_set_output.py#L110) are created with specific create_container methods that allows automatically converts the input dataframe into the desired output mtype. We could potentially leverage the same idea and put that down as part of the BaseConverter depending on how we want to use that class.

So as part of the predict_* function, we would use which data container the user has specified via set_output and then retrieve the corresponding skpro 'adapter' and corresponding methods to convert it to the desired mtype.

I'll edit the design doc to highlight that soon

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jun 28, 2024

Follow up question: could you give an example of how a BaseConverter would look like? Would the naming convention be something like PandasPolarsConverter, if we're marking that it converts pandas to polars

fully worked examples can be found in sktime/sktime#6033

@fkiraly fkiraly marked this pull request as ready for review July 14, 2024 18:22
@fkiraly fkiraly changed the title [ENH] refactor datatypes mtypes [ENH] refactor datatypes mtypes - checkers Jul 14, 2024
@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 14, 2024

FYI @julian-fong, this should be ready now

@fkiraly fkiraly changed the title [ENH] refactor datatypes mtypes - checkers [ENH] refactor datatypes mtypes - checkers, converters Jul 25, 2024
@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 25, 2024

@julian-fong, the design for converters is implemented - and two classes, one which contains a single conversion, and one which contains multiple. Review appreciated, before moving to translating the entire module.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 25, 2024

Failure is due to inconsistent handling of soft dependencies in checkers and converter dictionaries - this needs to be looked at, likely arising from the TableIdentity, because it loops over all mtypes, even though those requiring soft dependencies.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 25, 2024

any idea why this is failing on 3.12?

@julian-fong
Copy link
Contributor

julian-fong commented Jul 25, 2024

any idea why this is failing on 3.12?

i checked the soft deps for 3.12[all-extras] i think its because pyarrow is missing on the list. Could be that?

Comment on lines +80 to +83
self_params = self.get_params()

need_check = [k for k in self_params if self_params[k] is not None]
self_dict = {k: self_params[k] for k in need_check}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question - what is this code used for exactly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a feature, where you instantiate the type with arguments. This defines a subtype, e.g., "numpy 2D without nans", and then checks against the subtype, rather than just the overall type.

@julian-fong
Copy link
Contributor

Would we need extension templates for users to contribute new BaseConverters and BaseTables?

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 25, 2024

Would we need extension templates for users to contribute new BaseConverters and BaseTables?

Yes, of course! Once the design is final, I'd say.

Also, in the final verison we should add precise specifications for each mtype in the docstring, and list them on an API reference page.

@julian-fong
Copy link
Contributor

julian-fong commented Jul 25, 2024

Would we need extension templates for users to contribute new BaseConverters and BaseTables?

Yes, of course! Once the design is final, I'd say.

Also, in the final verison we should add precise specifications for each mtype in the docstring, and list them on an API reference page.

Maybe we should include what types of specifications needs to be written for each mtype in a separate issue, along with any other new information that could be written inside the documentation of skpro

@fkiraly
Copy link
Collaborator Author

fkiraly commented Jul 25, 2024

agree! do you want to open it?

@fkiraly fkiraly merged commit 31da07d into main Sep 7, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:datatypes datatypes module: data containers, checkers & converters
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants