-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make DataFrame and Series/Index manage the connection with each other. #1592
Conversation
hmm... I also have experienced same problem with It was resolved automatically without any fix, all I did just push an empty commit. weird. |
@itholic Thanks for letting me know. I'm investigating the reason. |
Let me re-run tests, just in case. |
for old_label, new_label in zip_longest( | ||
self._internal.column_labels, internal.column_labels | ||
): | ||
if old_label is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin, can you add some comments here? It's a bit difficult to follow here ..
self._internal_frame = internal | ||
|
||
@property | ||
def _ksers(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin, shell we add some dostrings here too?
databricks/koalas/frame.py
Outdated
): | ||
if old_label is not None: | ||
kser = self._ksers[old_label] | ||
if old_label != new_label or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we assume the anchor is different 1. if a position of a label was changed 2. if the label was changed (?). Might be best to add comments here ..
result.name = key | ||
result = first_series( | ||
DataFrame(InternalFrame(spark_frame=sdf, index_map=None)).T | ||
).rename(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin, out of curiosity, was it changed only because of the style or because ser.name = key
doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At that time I was working around here, ser.name = key
didn't work.
We can revert it but I thinks .rename(key)
is more natural here.
@@ -812,12 +812,12 @@ def resolved_copy(self): | |||
def with_new_sdf( | |||
self, spark_frame: spark.DataFrame, data_columns: Optional[List[str]] = None | |||
) -> "InternalFrame": | |||
""" Copy the immutable _InternalFrame with the updates by the specified Spark DataFrame. | |||
""" Copy the immutable InternalFrame with the updates by the specified Spark DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just change:
databricks/koalas/frame.py: :type _internal: _InternalFrame
databricks/koalas/frame.py: The given label must be verified to exist in `_InternalFrame.column_labels`.
databricks/koalas/frame.py: `self._kser_for(label)` can be used with `_InternalFrame.column_labels`:
databricks/koalas/groupby.py: # TODO: deduplicate this logic with _InternalFrame.from_pandas
too while we're here?
def _internal(self) -> InternalFrame: | ||
internal = self._kdf._internal | ||
return internal.copy( | ||
spark_column=internal.index_spark_columns[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe completely remove spark_column
in InternalFrame
since Series and Index don't hold them directly anymore. Maybe in a separate pr ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm planning a clean-up later.
@@ -1276,15 +1291,15 @@ def _NotImplemented(description): | |||
|
|||
@lazy_property | |||
def _internal(self): | |||
internal = super(iLocIndexer, self)._internal | |||
internal = super(iLocIndexer, self)._internal.resolved_copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we leave a short comment why it should use resolved_copy
?
internal.spark_frame.select(internal.spark_columns) | ||
) | ||
self._kdf._update_internal_frame( | ||
self._kdf._internal.resolved_copy, requires_same_anchor=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also leave a comment why we should use resolved_copy
?
# pser.name = None | ||
# kser.name = None | ||
# self.assertEqual(kser.name, None) | ||
# self.assert_eq(kser, pser) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm .. so this case doesn't work anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seires
without the name is not working properly now anyway.
We should revisit later.
@@ -443,7 +443,9 @@ def cache(self): | |||
""" | |||
from databricks.koalas.frame import CachedDataFrame | |||
|
|||
self._kdf._internal = self._kdf._internal.resolved_copy | |||
self._kdf._update_internal_frame( | |||
self._kdf._internal.resolved_copy, requires_same_anchor=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do we need to set requires_same_anchor=False
?
The approach looks fine in general but had some questions on the details. |
I am going to merge this to unblock the release. |
pass | ||
|
||
@property | ||
def _kdf(self) -> DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some question here !
For IndexOpsMixin
class, we have _anchor
and _kdf
to indicates DataFrame that corresponds with Series
or Index
, and they are exactly same.
Is there a special reason that we have both property even they indicates same object ?
Making
DataFrame
andSeries
/Index
manage the connection with each other to support inplace updates.Series
andIndex
don't manage its ownInternalFrame
anymore, basically they refer the anchorDataFrame
and create theInternalFrame
only when needed.E.g.,
Here,
pser.fillna(0, inplace=True)
should also updatepdf
, whereas:Other examples:
The update of
pser
should be refected topdf
.The update of
pdf
should be reflected topser
.