Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve DataFrame type after LazyFrame roundtrips #2862

Merged

Conversation

JakobGM
Copy link
Contributor

@JakobGM JakobGM commented Mar 9, 2022

Preserve inheritance on DataFrame.lazy().collect() roundtrips

The following test will currently fail:

import polars as pl

class MyDataFrame(pl.DataFrame):
    pass

assert isinstance(MyDataFrame().lazy().collect(), MyDataFrame)

The main reason is that DataFrame.lazy() is "forced" to create a new instance of LazyFrame which has no notion of which class or subclass of DataFrame that originally spawned it.

Solution

A solution to this issue is to store a pointer in the LazyFrame instance to the original DataFrame (sub)class that created it. It has the following definition:

  • LazyFrame._dataframe_class - A pointer to a DataFrame class or subclass which must be used whenever a LazyFrame must construct a new DataFrame instance.

By default, this class variable is hard-coded to DataFrame as a class property:

assert pl.LazyFrame()._dataframe_class is pl.DataFrame

But for any subclass of pl.DataFrame, for example MyDataFrame, must use a sub-class of pl.LazyFrame where LazyFrame._dataframe_class is set to MyDataFrame and not pl.DataFrame. For this reason, each class or subclass of pl.DataFrame has a class variable _lazyframe_class set to a class or subclass of pl.LazyFrame with a correct value for _lazyframe_class._dataframe_class.

  • DataFrame._lazyframe_class - A pointer to a LazyFrame class or subclass which must be used whenever a DataFrame must construct a new LazyFrame instance.

This is where the logic starts to become pretty circular, but take the following (now passing) test as an example:

assert pl.DataFrame._lazyframe_class is pl.LazyFrame


class MyDataFrame(pl.DataFrame):
    pass


assert MyDataFrame._lazyframe_class._dataframe_class is MyDataFrame
assert isinstance(MyDataFrame().lazy().collect(), MyDataFrame)

The end result is that DataFrame subclasses are able to spawn LazyFrame subclasses which are able to spawn the original DataFrame subclasses ♻️ This magic is achieved by specifying a custom metaclass for DataFrame as defined here. MyDataFrame automatically gets a custom subclass of LazyFrame named LazyMyDataFrame which is then stored on MyDataFrame._lazyframe_class.

Allowing end users to extend pl.LazyFrame

By default, if an end user extends the functionality of pl.DataFrame by inheriting from it, then a custom subclass of pl.LazyFrame is created. The only "task" of this subclass is to return the correct DataFrame type when casted back into a non-lazy representation. If an end user needs to extend the functionality of pl.LazyFrame in addition to pl.DataFrame, and must therefore connect these two classes together, it can be done in the following manner:

class MyLazyFrame(pl.LazyFrame):
    @property
    def _dataframe_class(self):
        return MyDataFrame

class MyDataFrame(pl.DataFrame):
    _lazyframe_class = MyLazyFrame

assert isinstance(MyDataFrame().lazy(), MyLazyFrame)
assert isinstance(MyDataFrame().lazy().collect(), MyDataFrame)

The reason for the @property is to work around the circular references between these two classes.

In the future, if polars would like to officially support extension through inheritance, these class variables _lazyframe_class and _dataframe_class can be renamed to lazyframe_class and dataframe_class in order to communicate that these variables are public API. Pydantic's Model Config class and Django's Meta class could be API patterns to replicate as well, moving these configuration variables into a nested configuration class.

Type annotations

Now that the type of MyDataFrame().lazy().collect() is correct, we must also provide enough information to type checkers so that they understand this reality:

reveal_type(MyDataFrame().lazy().collect())  # MyDataFrame, not DataFrame

In order to achieve this, we must be able to annotate that a given instance of LazyFrame is associated to a specific subclass of DataFrame. The solution to this problem is generics. In the same way that you can indicate that a list x contains integers by writing x: list[int], making the type checker understand that the return type of x.pop() is int, we should be able to annotate a lazy dataframe as ldf: LazyFrame[MyDataFrame] in order to specify that the return type of ldf.collect() is MyDataFrame and not DataFrame.

reveal_type(MyDataFrame().lazy())  # LazyFrame[MyDataFrame]

The solution is to let LazyFrame inherit from typing.Generic in the following way:

class LazyFrame(Generic[DF]):
    ...

    def collect(self, ...) -> DF:

And then type annotate DataFrame in the following way.

class DataFrame(...):
    ...

    def lazy(self: DF, ...) -> LazyFrame[DF]:
        ...

Restructuring of imports

Since how a custom subclass of LazyFrame must be created at import-time of DataFrame (see the implementation of DataFrameMetaClass), we must import polars.internals.lazy_frame.LazyFrame from polars.internals.frame. I have restructured the imports such that polars.internal imports all the lazy_frame symbols through frame instead.

@github-actions github-actions bot added the python Related to Python Polars label Mar 9, 2022
@JakobGM JakobGM force-pushed the preserve-subclasses-after-roundtrips branch 3 times, most recently from 2fb14b5 to c3c8719 Compare March 9, 2022 14:56
@JakobGM JakobGM marked this pull request as draft March 10, 2022 07:45
@JakobGM
Copy link
Contributor Author

JakobGM commented Mar 10, 2022

There is a problem with how I have combined @classmethod and @property, as that is not supported until python 3.9 (I was not aware). I will fix it when I have the time and convert this PR back from draft 😃

@JakobGM JakobGM force-pushed the preserve-subclasses-after-roundtrips branch 3 times, most recently from f4b89d9 to f05d2d3 Compare March 10, 2022 13:46
@JakobGM JakobGM marked this pull request as ready for review March 10, 2022 13:57
This method now preserves the type of self on roundtrips so we
no longer need to cast it to the correct type.
@JakobGM JakobGM force-pushed the preserve-subclasses-after-roundtrips branch from f05d2d3 to bb02cdf Compare March 10, 2022 14:01
@ritchie46
Copy link
Member

Again, an excellent write up. Thanks a lot. Especially the tricks to maintain typing information, well done. 👍

In the future, if polars would like to officially support extension through inheritance, these class variables _lazyframe_class and _dataframe_class can be renamed to lazyframe_class and dataframe_class in order to communicate that these variables are public API. Pydantic's Model Config class and Django's Meta class could be API patterns to replicate as well, moving these configuration variables into a nested configuration class.

What do you mean by this? Is there a difference between the inheritance functionality we are adding now and extension through inheritance?

@JakobGM
Copy link
Contributor Author

JakobGM commented Mar 10, 2022

Again, an excellent write up. Thanks a lot. Especially the tricks to maintain typing information, well done. 👍

Thanks 🙇

In the future, if polars would like to officially support extension through inheritance, these class variables _lazyframe_class and _dataframe_class can be renamed to lazyframe_class and dataframe_class in order to communicate that these variables are public API. Pydantic's Model Config class and Django's Meta class could be API patterns to replicate as well, moving these configuration variables into a nested configuration class.

What do you mean by this? Is there a difference between the inheritance functionality we are adding now and extension through inheritance?

No, that did not come out as clear as I wanted it to be 😅 The things I meant to say was the following:

A user of polars can now easily extend the functionality of DataFrame like so:

class ExtendedDataFrame(pl.DataFrame):
		# Extended functionality here...

But if a user would like to extend the functionality of DataFrame and LazyFrame it must be done by setting _lazyframe_class on ExtendedDataFrame like so:

class ExtendedLazyFrame(pl.LazyFrame):
	  @property
	  def _dataframe_class(self):
	          return ExtendedDataFrame


class ExtendedDataFrame(pl.DataFrame):
	  # Extended functionality here...
	  _lazyframe_class = ExtendedLazyFrame

The fact that both _dataframe_class and _lazyframe_class are prepended with underscores communicates to the end user that they are using private APIs and shouldn't really rely on the behavior. I.e. what we are communicating is that extending DataFrame is officially supported, but extending LazyFrame is not (since it requires the user to set private attributes). Nothing stops the user from doing so (it is Python, it allows you to do whatever you want 😅 ), but polars communicates that you should not do it. But I think it is OK for now, I will start to use this API a lot the next couple of months through work, so I guess I can be the beta tester of this functionality for now, and then when we feel that it works well, we can document it properly (and rename the class variables to something that does not have underscores prepended to them).

@ritchie46
Copy link
Member

Check.. Then I understand what you mean. :)

The fact that both _dataframe_class and _lazyframe_class are prepended with underscores communicates to the end user that they are using private APIs and shouldn't really rely on the behavior. I.e. what we are communicating is that extending DataFrame is officially supported, but extending LazyFrame is not (since it requires the user to set private attributes). Nothing stops the user from doing so (it is Python, it allows you to do whatever you want sweat_smile ), but polars communicates that you should not do it. But I think it is OK for now, I will start to use this API a lot the next couple of months through work, so I guess I can be the beta tester of this functionality for now, and then when we feel that it works well, we can document it properly (and rename the class variables to something that does not have underscores prepended to them).

Sounds good. I have no problem with supporting the LazyFrame <-> DataFrame combinations. So do your tests and then we come back on the public API later. 💯

I will merge this in. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants