-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve DataFrame type after LazyFrame roundtrips #2862
Preserve DataFrame type after LazyFrame roundtrips #2862
Conversation
2fb14b5
to
c3c8719
Compare
There is a problem with how I have combined |
f4b89d9
to
f05d2d3
Compare
This method now preserves the type of self on roundtrips so we no longer need to cast it to the correct type.
f05d2d3
to
bb02cdf
Compare
Again, an excellent write up. Thanks a lot. Especially the tricks to maintain typing information, well done. 👍
What do you mean by this? Is there a difference between the inheritance functionality we are adding now and extension through inheritance? |
Thanks 🙇
No, that did not come out as clear as I wanted it to be 😅 The things I meant to say was the following: A user of polars can now easily extend the functionality of class ExtendedDataFrame(pl.DataFrame):
# Extended functionality here... But if a user would like to extend the functionality of class ExtendedLazyFrame(pl.LazyFrame):
@property
def _dataframe_class(self):
return ExtendedDataFrame
class ExtendedDataFrame(pl.DataFrame):
# Extended functionality here...
_lazyframe_class = ExtendedLazyFrame The fact that both |
Check.. Then I understand what you mean. :)
Sounds good. I have no problem with supporting the I will merge this in. Thanks again! |
Preserve inheritance on
DataFrame.lazy().collect()
roundtripsThe following test will currently fail:
The main reason is that
DataFrame.lazy()
is "forced" to create a new instance ofLazyFrame
which has no notion of which class or subclass ofDataFrame
that originally spawned it.Solution
A solution to this issue is to store a pointer in the
LazyFrame
instance to the originalDataFrame
(sub)class that created it. It has the following definition:LazyFrame._dataframe_class
- A pointer to aDataFrame
class or subclass which must be used whenever aLazyFrame
must construct a newDataFrame
instance.By default, this class variable is hard-coded to
DataFrame
as a class property:But for any subclass of
pl.DataFrame
, for exampleMyDataFrame
, must use a sub-class ofpl.LazyFrame
whereLazyFrame._dataframe_class
is set toMyDataFrame
and notpl.DataFrame
. For this reason, each class or subclass ofpl.DataFrame
has a class variable_lazyframe_class
set to a class or subclass ofpl.LazyFrame
with a correct value for_lazyframe_class._dataframe_class
.DataFrame._lazyframe_class
- A pointer to aLazyFrame
class or subclass which must be used whenever aDataFrame
must construct a newLazyFrame
instance.This is where the logic starts to become pretty circular, but take the following (now passing) test as an example:
The end result is that
DataFrame
subclasses are able to spawnLazyFrame
subclasses which are able to spawn the originalDataFrame
subclasses ♻️ This magic is achieved by specifying a custom metaclass forDataFrame
as defined here.MyDataFrame
automatically gets a custom subclass ofLazyFrame
namedLazyMyDataFrame
which is then stored onMyDataFrame._lazyframe_class
.Allowing end users to extend
pl.LazyFrame
By default, if an end user extends the functionality of
pl.DataFrame
by inheriting from it, then a custom subclass ofpl.LazyFrame
is created. The only "task" of this subclass is to return the correctDataFrame
type when casted back into a non-lazy representation. If an end user needs to extend the functionality ofpl.LazyFrame
in addition topl.DataFrame
, and must therefore connect these two classes together, it can be done in the following manner:The reason for the
@property
is to work around the circular references between these two classes.In the future, if polars would like to officially support extension through inheritance, these class variables
_lazyframe_class
and_dataframe_class
can be renamed tolazyframe_class
anddataframe_class
in order to communicate that these variables are public API. Pydantic's Model Config class and Django's Meta class could be API patterns to replicate as well, moving these configuration variables into a nested configuration class.Type annotations
Now that the type of
MyDataFrame().lazy().collect()
is correct, we must also provide enough information to type checkers so that they understand this reality:In order to achieve this, we must be able to annotate that a given instance of
LazyFrame
is associated to a specific subclass ofDataFrame
. The solution to this problem is generics. In the same way that you can indicate that a listx
contains integers by writingx: list[int]
, making the type checker understand that the return type ofx.pop()
isint
, we should be able to annotate a lazy dataframe asldf: LazyFrame[MyDataFrame]
in order to specify that the return type ofldf.collect()
isMyDataFrame
and notDataFrame
.The solution is to let
LazyFrame
inherit fromtyping.Generic
in the following way:And then type annotate
DataFrame
in the following way.Restructuring of imports
Since how a custom subclass of
LazyFrame
must be created at import-time ofDataFrame
(see the implementation ofDataFrameMetaClass
), we must importpolars.internals.lazy_frame.LazyFrame
frompolars.internals.frame
. I have restructured the imports such thatpolars.internal
imports all thelazy_frame
symbols throughframe
instead.